Week 7 Monday

Week 7 Monday#

Announcements#

In-class quiz tomorrow. Based on Worksheets due tonight.
Worksheet 13 is posted, on Logistic Regression (the topic of Prof. Zhou’s lecture on Friday and the lectures this week).
Not expected to know from Friday: gradient descent for finding the coefficients beta, Matplotlib.
Next week’s quiz will be based on Worksheets 13 and 14 and this week’s lectures. (We will review the most relevant parts of Prof. Zhou’s lecture.)
Plan for today: Approximately 15 minutes of lecture, then 10-15 minutes to work (Maya and Hanson are here to answer questions), then more lecture.

import pandas as pd
import numpy as np
import altair as alt
import seaborn as sns

Predicting if a penguin is in the Chinstrap species#

Today we will use Logistic Regression with the flipper length and bill length columns in the penguins dataset to predict if a penguin is in the Chinstrap species.

df = sns.load_dataset("penguins").dropna(axis=0)

Here are the columns we’re going to use. I didn’t want to keep typing these repeatedly, so I stored them as a length two list.

cols = ["flipper_length_mm", "bill_length_mm"]

Here is what the true data looks like. I chose two input variables because we can visualize this easily. With three or more input variables, I wouldn’t be able to draw a chart like the following showing all the information.

When you look at the following, you should be thinking that we will be inputting the flipper length and the bill length, and outputting a prediction for whether a penguin belongs to the Chinstrap species.

alt.Chart(df).mark_circle().encode(
    x=alt.X(cols[0], scale=alt.Scale(zero=False)),
    y=alt.Y(cols[1], scale=alt.Scale(zero=False)),
    color="species"
)

The procedure is basically the same as for Linear Regression.

First we import.

from sklearn.linear_model import LogisticRegression

Then we instantiate. The convention is to name this object clf, for “classifier”, since we will be performing classification, not regression.

# clf instead of reg because it's for classification
clf = LogisticRegression()

Here are the values we will be predicting. Logistic regression is easiest to understand (especially the formula is easiest to understand) when we are predicting a binary variable (meaning a variable with only two options). We will see later an example of predicting the “species” variable itself, which in this case has three options.

df["isChinstrap"] = (df["species"] == "Chinstrap")

Notice how there is a new Boolean column named “isChinstrap” which indicates whether or not it is a Chinstrap penguin. I’m using sample here rather than head because the penguins at the beginning of the dataset are all the same species.

df.sample(10, random_state=1)

	species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex	isChinstrap
65	Adelie	Biscoe	41.6	18.0	192.0	3950.0	Male	False
276	Gentoo	Biscoe	43.8	13.9	208.0	4300.0	Female	False
186	Chinstrap	Dream	49.7	18.6	195.0	3600.0	Male	True
198	Chinstrap	Dream	50.1	17.9	190.0	3400.0	Female	True
293	Gentoo	Biscoe	46.5	14.8	217.0	5200.0	Female	False
183	Chinstrap	Dream	54.2	20.8	201.0	4300.0	Male	True
98	Adelie	Dream	33.1	16.1	178.0	2900.0	Female	False
193	Chinstrap	Dream	46.2	17.5	187.0	3650.0	Female	True
95	Adelie	Dream	40.8	18.9	208.0	4300.0	Male	False
195	Chinstrap	Dream	45.5	17.0	196.0	3500.0	Female	True

We now do the fit step. Notice how we are using two input features and we are using our new Boolean “isChinstrap” column for the target.

clf.fit(df[cols], df["isChinstrap"])

LogisticRegression()

Here is a sense of what the output looks like. It seems pretty accurate, because also in the real dataset, the Chinstrap penguins are all in the same part of the dataset.

clf.predict(df[cols])

array([False, False, False, False, False, False, False, False, False,
       False, False, False, False, False,  True, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False,  True, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False,  True, False, False,
       False, False, False, False, False, False, False, False,  True,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False,  True,  True,  True,  True,  True, False,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True, False,  True, False,  True,
        True,  True,  True,  True,  True,  True, False,  True,  True,
       False,  True,  True,  True, False,  True,  True,  True,  True,
        True,  True, False,  True,  True,  True,  True,  True,  True,
        True,  True,  True, False,  True,  True,  True, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False,  True, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False,  True, False, False, False, False,
       False, False, False,  True, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False])

We are putting exactly those values above into a new column named “pred”.

df["pred"] = clf.predict(df[cols])

Here we can compare the actual values of “isChinstrap” to the values of “pred”.

df.sample(10, random_state=1)

	species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex	isChinstrap	pred
65	Adelie	Biscoe	41.6	18.0	192.0	3950.0	Male	False	False
276	Gentoo	Biscoe	43.8	13.9	208.0	4300.0	Female	False	False
186	Chinstrap	Dream	49.7	18.6	195.0	3600.0	Male	True	True
198	Chinstrap	Dream	50.1	17.9	190.0	3400.0	Female	True	True
293	Gentoo	Biscoe	46.5	14.8	217.0	5200.0	Female	False	False
183	Chinstrap	Dream	54.2	20.8	201.0	4300.0	Male	True	True
98	Adelie	Dream	33.1	16.1	178.0	2900.0	Female	False	False
193	Chinstrap	Dream	46.2	17.5	187.0	3650.0	Female	True	True
95	Adelie	Dream	40.8	18.9	208.0	4300.0	Male	False	False
195	Chinstrap	Dream	45.5	17.0	196.0	3500.0	Female	True	False

Let’s see overall how accurate our predictions are. We first make a Boolean Series indicating whether the prediction was correct.

df["pred"] == df["isChinstrap"]

    True
    True
    True
    True
    True
       ... 
  True
  True
  True
  True
  True
Length: 333, dtype: bool

A simple way to find the proportion of correct predictions is to call the mean method of the above Boolean Series. (Remember that True is treated as 1 and False is treated as 0.)

(df["pred"] == df["isChinstrap"]).mean()

0.954954954954955

There is also a way to use clf directly to find the proportion of correct predictions, by calling its score method. We pass as inputs to score the input features along with the desired outputs. Notice that we get the same number as above. (Reminder: the reason both regression and classification are considered to be supervised machine learning is because we imagine some “supervisor” has given us the desired outputs for at least some of the data.)

clf.score(df[cols], df["isChinstrap"])

0.954954954954955

Now let’s see a systematic way to find all the rows of df where the prediction is incorrect. We start with a Boolean Series which has True for the rows where the prediction is incorrect. (Another way to create this same pandas Series would be to use ~df["pred"] == df["isChinstrap"].)

df["pred"] != df["isChinstrap"]

    False
    False
    False
    False
    False
       ...  
  False
  False
  False
  False
  False
Length: 333, dtype: bool

Now we can use Boolean indexing like usual. The following is the sub-DataFrame containing all the rows where the prediction is wrong.

df[df["pred"] != df["isChinstrap"]]

	species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex	isChinstrap	pred
19	Adelie	Torgersen	46.0	21.5	194.0	4200.0	Male	False	True
37	Adelie	Dream	42.2	18.5	180.0	3550.0	Female	False	True
111	Adelie	Biscoe	45.6	20.3	191.0	4600.0	Male	False	True
122	Adelie	Torgersen	40.2	17.0	176.0	3450.0	Female	False	True
157	Chinstrap	Dream	45.2	17.8	198.0	3950.0	Female	True	False
182	Chinstrap	Dream	40.9	16.6	187.0	3200.0	Female	True	False
184	Chinstrap	Dream	42.5	16.7	187.0	3350.0	Female	True	False
192	Chinstrap	Dream	49.0	19.5	210.0	3950.0	Male	True	False
195	Chinstrap	Dream	45.5	17.0	196.0	3500.0	Female	True	False
199	Chinstrap	Dream	49.0	19.6	212.0	4300.0	Male	True	False
206	Chinstrap	Dream	42.5	17.3	187.0	3350.0	Female	True	False
216	Chinstrap	Dream	43.5	18.1	202.0	3400.0	Female	True	False
253	Gentoo	Biscoe	59.6	17.0	230.0	6050.0	Male	False	True
318	Gentoo	Biscoe	48.4	14.4	203.0	4625.0	Female	False	True
327	Gentoo	Biscoe	53.4	15.8	219.0	5500.0	Male	False	True

There are 15 rows above. If we want to know what proportion of predictions are incorrect, then we can divide by the length of the DataFrame. The following should be \(1-x\) where \(x\) is the score value we computed above.

# proportion wrong
15/len(df)

0.04504504504504504

If all we care about are the row labels where the prediction is wrong, we can use Boolean indexing on df.index. This will give us the row labels where the prediction is wrong.

df.index[df["pred"] != df["isChinstrap"]]

Int64Index([19, 37, 111, 122, 157, 182, 184, 192, 195, 199, 206, 216, 253, 318,
            327],
           dtype='int64')

Let’s look at just three inputs, two of which have the correct output, but the input from the row with label 19 does not.

df.loc[[18, 19, 20], cols]

	flipper_length_mm	bill_length_mm
18	184.0	34.4
19	194.0	46.0
20	174.0	37.8

Here are the predicted outputs. Remember that the middle of these is wrong (so all three should be False, meaning none of these three are the Chinstrap species).

clf.predict(df.loc[[18, 19, 20], cols])

array([False,  True, False])

A great feature of logistic regression is that it indicates not just a predicted output, but also a confidence. We can get these confidences by using the predict_proba method of the fit LogisticRegression object.

Here are the corresponding probabilities for the three inputs above.

You should interpret the following as saying our logistic regression model is 99.977% confident in its False prediction for the row with label 18, but is only 67% confident in its prediction for the row with label 19. This is good, because it was correct for the row with label 18 but was incorrect for the row with label 19.

We will explain two cells below how we know which number is for False and which number is for True.

# predicted probabilities
clf.predict_proba(df.loc[[18, 19, 20], cols])

array([[9.99771697e-01, 2.28303114e-04],
       [3.29559670e-01, 6.70440330e-01],
       [7.71818762e-01, 2.28181238e-01]])

The column at index 0 corresponds to False predictions and the column at index 1 corresponds to True predictions. I don’t think there is any way to know that in advance, but we can check it using the classes_ attribute.

clf.classes_

array([False,  True])

The whole point of logistic regression is to find parameters for estimating probabilities. That is also the difficult part (it happens when we call fit).

In our specific case, we have two input variables, so we need to find two coefficients and one intercept (or bias), so three parameters total.

Here are the coefficients. They are in the same order as the variables in cols.

clf.coef_

array([[-0.34802208,  1.08405225]])

Here is the intercept.

clf.intercept_

array([18.36005773])

Here is a reminder of which input feature came first.

cols[0]

'flipper_length_mm'

Let \(x_1\) denote the flipper length and let \(x_2\) denote the bill length. We are plugging approximately \(18.36 + -0.35 \cdot x_1 + 1.08 \cdot x_2\) into the logistic function \(1/(1 + e^{-x})\). The result will be a probability estimate for the penguin being a Chinstrap penguin.

Let’s see this formula in action for the row with label 19 where we know our classifier makes a mistake.

df.loc[19, cols]

flipper_length_mm    194.0
bill_length_mm        46.0
Name: 19, dtype: object

When we plug in the values above, we get the following. Here we use np.exp(x) to compute \(e^x\).

I was surprised that this is not 0.67.

# I expected 0.67, will fix this on Wednesday
1/(1+np.exp(-(18.36+-0.35*194+1.08*46)))

0.5349429451582182

Added after class: My mistake was that I rounded too much. (I found it very difficult to catch this mistake.) Let’s try with one more decimal place and see that we really do get 0.67, as when we called clf.predict_proba.

1/(1+np.exp(-(18.36+-0.348*194+1.084*46)))

0.670842935992734

Week 7 Monday

Contents

Week 7 Monday#

Announcements#

Predicting if a penguin is in the Chinstrap species#