Decision boundary for logistic regression

Decision boundary for logistic regression

Announcements

  • I have notecards at the front.

  • Videos and video quizzes due tomorrow (Tuesday). Mostly review of overfitting.

  • Homework due tomorrow, Tuesday 11:59pm.

  • No in-class quiz tomorrow. Yasmeen will go over the sample midterm. (The sample midterm is posted on the Week 8 page on Canvas.)

  • Midterm is Thursday.

  • Only one in-class quiz left, Tuesday of Week 10.

Predicting if a penguin is in the Chinstrap species

The idea of logistic regression is to model the probability that a data point belongs to a certain class.

  • (Same as Friday last week.) Using the penguins dataset from Seaborn, fit a logistic regression model to classify whether or not a penguin is in the Chinstrap species, using its flipper length and its bill length.

  • Using the model, describe all flipper lengths and bill lengths for which our model thinks there is an 80% chance the penguin is Chinstrap. Give your answer as a formula for bill length in terms of flipper length.

import pandas as pd
import altair as alt
import seaborn as sns
df = sns.load_dataset("penguins").dropna()
cols = ["flipper_length_mm", "bill_length_mm"]
alt.Chart(df).mark_circle().encode(
    x=alt.X("flipper_length_mm", scale=alt.Scale(zero=False)),
    y=alt.Y("bill_length_mm", scale=alt.Scale(zero=False)),
    color="species"
)

The coefficients in logistic regression are easiest to interpret when we are dealing with a binary classification problem (i.e., when there are only two classes). Our two classes will be True (corresponding to “Chinstrap”) and False (corresponding to “not Chinstrap”).

df["is_Chinstrap"] = (df["species"]=="Chinstrap")
df.head()
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex is_Chinstrap
0 Adelie Torgersen 39.1 18.7 181.0 3750.0 Male False
1 Adelie Torgersen 39.5 17.4 186.0 3800.0 Female False
2 Adelie Torgersen 40.3 18.0 195.0 3250.0 Female False
4 Adelie Torgersen 36.7 19.3 193.0 3450.0 Female False
5 Adelie Torgersen 39.3 20.6 190.0 3650.0 Male False

We now perform logistic regression, using only two features from the dataset for our inputs, and using the “is_Chinstrap” column for the target (output).

from sklearn.linear_model import LogisticRegression
clf = LogisticRegression()
cols
['flipper_length_mm', 'bill_length_mm']

Recall that logistic regression is used for classification problems. One difference between classification problems and clustering problems is that, in classification problems, we provide sample outputs (also called labels or targets). That is related, in the following expression, to the fact that we give an output y, in this case that output is df["is_Chinstrap"]. When calling the fit method on a clustering object, only inputs are supplied, no outputs.

clf.fit(df[cols], df["is_Chinstrap"])
LogisticRegression()
intercept = clf.intercept_

Here we see what the two coefficients are, but which coefficient corresponds to what variable?

clf.coef_
array([[-0.34802208,  1.08405225]])

The coefficients are in the same order as the columns we used.

cols
['flipper_length_mm', 'bill_length_mm']

So the first coefficient is the flipper coefficient. The following does not work. Notice the extra set of square brackets on the outside when we displayed clf.coef_.

fcoef, bcoef = clf.coef_
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/var/folders/8j/gshrlmtn7dg4qtztj4d4t_w40000gn/T/ipykernel_31032/3736759329.py in <module>
----> 1 fcoef, bcoef = clf.coef_

ValueError: not enough values to unpack (expected 2, got 1)

We get rid of the outer brackets by evaluating clf.coef_[0]. Then we use array unpacking to assign the values of fcoef and bcoef.

fcoef, bcoef = clf.coef_[0]
bcoef
1.0840522534754853

For a given flipper length flength, what value of bill length will lead to an 80% confidence? We need to solve the following equation:

\[ 0.8 = 1/(1 + e^{-(\text{intercept}+\text{fcoef} \cdot \text{flength} + \text{bcoef} \cdot \text{blength})}) \]

The entire goal of logistic regression is to find the intercept and coefficients so that the right side of the above equation models the probability. In this particular problem, “probability” means the probability that the penguin is a Chinstrap penguin.

Using some algebra, we find the following function for the bill length.

bill80 = lambda flength: (1/bcoef)*(-ln((1/0.8)-1)-intercept-fcoef*flength)

We get an error because Python does not know to interpret ln as natural log.

bill80(200)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
/var/folders/8j/gshrlmtn7dg4qtztj4d4t_w40000gn/T/ipykernel_31032/845188934.py in <module>
----> 1 bill80(200)

/var/folders/8j/gshrlmtn7dg4qtztj4d4t_w40000gn/T/ipykernel_31032/2508809668.py in <lambda>(flength)
----> 1 bill80 = lambda flength: (1/bcoef)*(-ln((1/0.8)-1)-intercept-fcoef*flength)

NameError: name 'ln' is not defined

We’ll use NumPy, where natural log is represented by np.log.

import numpy as np

One small issue is that we need to replace intercept by intercept[0]. This is just because scikit-learn is not returning the intercept as a number, but as a length 1 array.

bill80 = lambda flength: (1/bcoef)*(-np.log((1/0.8)-1)-intercept[0]-fcoef*flength)

We get a very similar formula for 50% confidence. All we have to do is change the 0.8 to 0.5.

bill50 = lambda flength: (1/bcoef)*(-np.log((1/0.5)-1)-intercept[0]-fcoef*flength)

For example, the following is saying that if the flipper length is 200mm, then a bill length of 48.55mm leads our model to have 80% confidence that the penguin is a Chinstrap penguin.

bill80(200)
48.54992282607464

We’ll add these 80% and 50% values to our DataFrame. It would be more elegant to have 80 and 50 replaced by a variable, maybe using a dictionary, but this is good enough.

df["bdry80"] = df["flipper_length_mm"].map(bill80)
df["bdry50"] = df["flipper_length_mm"].map(bill50)
df.head()
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex is_Chinstrap bdry80 bdry50
0 Adelie Torgersen 39.1 18.7 181.0 3750.0 Male False 42.450199 41.171391
1 Adelie Torgersen 39.5 17.4 186.0 3800.0 Female False 44.055389 42.776582
2 Adelie Torgersen 40.3 18.0 195.0 3250.0 Female False 46.944732 45.665925
4 Adelie Torgersen 36.7 19.3 193.0 3450.0 Female False 46.302656 45.023848
5 Adelie Torgersen 39.3 20.6 190.0 3650.0 Male False 45.339542 44.060734
bill50(181)
41.171391105452194
c = alt.Chart(df).mark_circle().encode(
    x=alt.X("flipper_length_mm", scale=alt.Scale(zero=False)),
    y=alt.Y("bill_length_mm", scale=alt.Scale(zero=False)),
    color="species"
)
c80 = alt.Chart(df).mark_line(color="red").encode(
    x=alt.X("flipper_length_mm", scale=alt.Scale(zero=False)),
    y=alt.Y("bdry80", scale=alt.Scale(zero=False)),
)
c50 = alt.Chart(df).mark_line(color="black").encode(
    x=alt.X("flipper_length_mm", scale=alt.Scale(zero=False)),
    y=alt.Y("bdry50", scale=alt.Scale(zero=False)),
)

Above the red line, our model predicts over an 80% chance of being Chinstrap. Between the black line and the red line, our model predicts between a 50% and an 80% chance of being a Chinstrap penguin.

The black line represents what is called the decision boundary: on one side, our model predicts Chinstrap, on the other side, it predicts not Chinstrap. The reason logistic regression is considered a linear model (for exmaple, why it is imported from sklearn.linear_model) is related to the fact that these decision boundaries are linear. (We are finding coefficients of a linear function, even if the eventual output probability is not linear.)

Decision boundaries are also related to overfitting. In this case, because the decision boundary is so simple (just a straight line in two dimensions), that is a sign that we do not need to be too worried about overfitting in this case. If the decision boundary were overly flexible (like a high degree polynomial), and were bending around various Chinstrap data points, that would be a potential sign of overfitting. In the case of logistic regression, the decision boundaries will always be lines (or, in higher dimensions, planes).

c+c80+c50