Decision boundary for logistic regression¶

Announcements¶

I have notecards at the front.
Videos and video quizzes due tomorrow (Tuesday). Mostly review of overfitting.
Homework due tomorrow, Tuesday 11:59pm.
No in-class quiz tomorrow. Yasmeen will go over the sample midterm. (The sample midterm is posted on the Week 8 page on Canvas.)
Midterm is Thursday.
Only one in-class quiz left, Tuesday of Week 10.

Predicting if a penguin is in the Chinstrap species¶

The idea of logistic regression is to model the probability that a data point belongs to a certain class.

(Same as Friday last week.) Using the penguins dataset from Seaborn, fit a logistic regression model to classify whether or not a penguin is in the Chinstrap species, using its flipper length and its bill length.
Using the model, describe all flipper lengths and bill lengths for which our model thinks there is an 80% chance the penguin is Chinstrap. Give your answer as a formula for bill length in terms of flipper length.

import pandas as pd
import altair as alt
import seaborn as sns

df = sns.load_dataset("penguins").dropna()

cols = ["flipper_length_mm", "bill_length_mm"]

alt.Chart(df).mark_circle().encode(
    x=alt.X("flipper_length_mm", scale=alt.Scale(zero=False)),
    y=alt.Y("bill_length_mm", scale=alt.Scale(zero=False)),
    color="species"
)

The coefficients in logistic regression are easiest to interpret when we are dealing with a binary classification problem (i.e., when there are only two classes). Our two classes will be True (corresponding to “Chinstrap”) and False (corresponding to “not Chinstrap”).

df["is_Chinstrap"] = (df["species"]=="Chinstrap")

df.head()

	species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex	is_Chinstrap
0	Adelie	Torgersen	39.1	18.7	181.0	3750.0	Male	False
1	Adelie	Torgersen	39.5	17.4	186.0	3800.0	Female	False
2	Adelie	Torgersen	40.3	18.0	195.0	3250.0	Female	False
4	Adelie	Torgersen	36.7	19.3	193.0	3450.0	Female	False
5	Adelie	Torgersen	39.3	20.6	190.0	3650.0	Male	False

We now perform logistic regression, using only two features from the dataset for our inputs, and using the “is_Chinstrap” column for the target (output).

from sklearn.linear_model import LogisticRegression

clf = LogisticRegression()

cols

['flipper_length_mm', 'bill_length_mm']

Recall that logistic regression is used for classification problems. One difference between classification problems and clustering problems is that, in classification problems, we provide sample outputs (also called labels or targets). That is related, in the following expression, to the fact that we give an output y, in this case that output is df["is_Chinstrap"]. When calling the fit method on a clustering object, only inputs are supplied, no outputs.

clf.fit(df[cols], df["is_Chinstrap"])

LogisticRegression()

intercept = clf.intercept_

Here we see what the two coefficients are, but which coefficient corresponds to what variable?

clf.coef_

array([[-0.34802208,  1.08405225]])

The coefficients are in the same order as the columns we used.

cols

['flipper_length_mm', 'bill_length_mm']

So the first coefficient is the flipper coefficient. The following does not work. Notice the extra set of square brackets on the outside when we displayed clf.coef_.

fcoef, bcoef = clf.coef_

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/var/folders/8j/gshrlmtn7dg4qtztj4d4t_w40000gn/T/ipykernel_31032/3736759329.py in <module>
----> 1 fcoef, bcoef = clf.coef_

ValueError: not enough values to unpack (expected 2, got 1)

We get rid of the outer brackets by evaluating clf.coef_[0]. Then we use array unpacking to assign the values of fcoef and bcoef.

fcoef, bcoef = clf.coef_[0]

bcoef

1.0840522534754853

For a given flipper length flength, what value of bill length will lead to an 80% confidence? We need to solve the following equation:

\[ 0.8 = 1/(1 + e^{-(\text{intercept}+\text{fcoef} \cdot \text{flength} + \text{bcoef} \cdot \text{blength})}) \]

The entire goal of logistic regression is to find the intercept and coefficients so that the right side of the above equation models the probability. In this particular problem, “probability” means the probability that the penguin is a Chinstrap penguin.

Using some algebra, we find the following function for the bill length.

bill80 = lambda flength: (1/bcoef)*(-ln((1/0.8)-1)-intercept-fcoef*flength)

We get an error because Python does not know to interpret ln as natural log.

bill80(200)

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
/var/folders/8j/gshrlmtn7dg4qtztj4d4t_w40000gn/T/ipykernel_31032/845188934.py in <module>
----> 1 bill80(200)

/var/folders/8j/gshrlmtn7dg4qtztj4d4t_w40000gn/T/ipykernel_31032/2508809668.py in <lambda>(flength)
----> 1 bill80 = lambda flength: (1/bcoef)*(-ln((1/0.8)-1)-intercept-fcoef*flength)

NameError: name 'ln' is not defined

We’ll use NumPy, where natural log is represented by np.log.

import numpy as np

One small issue is that we need to replace intercept by intercept[0]. This is just because scikit-learn is not returning the intercept as a number, but as a length 1 array.

bill80 = lambda flength: (1/bcoef)*(-np.log((1/0.8)-1)-intercept[0]-fcoef*flength)

We get a very similar formula for 50% confidence. All we have to do is change the 0.8 to 0.5.

bill50 = lambda flength: (1/bcoef)*(-np.log((1/0.5)-1)-intercept[0]-fcoef*flength)

For example, the following is saying that if the flipper length is 200mm, then a bill length of 48.55mm leads our model to have 80% confidence that the penguin is a Chinstrap penguin.

bill80(200)

48.54992282607464

We’ll add these 80% and 50% values to our DataFrame. It would be more elegant to have 80 and 50 replaced by a variable, maybe using a dictionary, but this is good enough.

df["bdry80"] = df["flipper_length_mm"].map(bill80)

df["bdry50"] = df["flipper_length_mm"].map(bill50)

df.head()

	species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex	is_Chinstrap	bdry80	bdry50
0	Adelie	Torgersen	39.1	18.7	181.0	3750.0	Male	False	42.450199	41.171391
1	Adelie	Torgersen	39.5	17.4	186.0	3800.0	Female	False	44.055389	42.776582
2	Adelie	Torgersen	40.3	18.0	195.0	3250.0	Female	False	46.944732	45.665925
4	Adelie	Torgersen	36.7	19.3	193.0	3450.0	Female	False	46.302656	45.023848
5	Adelie	Torgersen	39.3	20.6	190.0	3650.0	Male	False	45.339542	44.060734

bill50(181)

41.171391105452194

c = alt.Chart(df).mark_circle().encode(
    x=alt.X("flipper_length_mm", scale=alt.Scale(zero=False)),
    y=alt.Y("bill_length_mm", scale=alt.Scale(zero=False)),
    color="species"
)

c80 = alt.Chart(df).mark_line(color="red").encode(
    x=alt.X("flipper_length_mm", scale=alt.Scale(zero=False)),
    y=alt.Y("bdry80", scale=alt.Scale(zero=False)),
)

c50 = alt.Chart(df).mark_line(color="black").encode(
    x=alt.X("flipper_length_mm", scale=alt.Scale(zero=False)),
    y=alt.Y("bdry50", scale=alt.Scale(zero=False)),
)

Above the red line, our model predicts over an 80% chance of being Chinstrap. Between the black line and the red line, our model predicts between a 50% and an 80% chance of being a Chinstrap penguin.

The black line represents what is called the decision boundary: on one side, our model predicts Chinstrap, on the other side, it predicts not Chinstrap. The reason logistic regression is considered a linear model (for exmaple, why it is imported from sklearn.linear_model) is related to the fact that these decision boundaries are linear. (We are finding coefficients of a linear function, even if the eventual output probability is not linear.)

Decision boundaries are also related to overfitting. In this case, because the decision boundary is so simple (just a straight line in two dimensions), that is a sign that we do not need to be too worried about overfitting in this case. If the decision boundary were overly flexible (like a high degree polynomial), and were bending around various Chinstrap data points, that would be a potential sign of overfitting. In the case of logistic regression, the decision boundaries will always be lines (or, in higher dimensions, planes).

c+c80+c50

UC Irvine Math 10 S22

Decision boundary for logistic regression

Contents

Decision boundary for logistic regression¶

Announcements¶

Predicting if a penguin is in the Chinstrap species¶