Logistic Regression

Announcements

  • I have notecards. Ask if you need one.

  • Videos posted for next week. Mostly review of overfitting. Due Tuesday (not the usual Thursday).

  • Midterm is Thursday next week.

  • A sample midterm is posted on the Week 8 page on Canvas. Yasmeen will go over solutions in discussion section on Tuesday.

  • Only one in-class quiz left, Tuesday of Week 10.

The main categories for Machine Learning

Machine Learning is divided into two broad categories.

  • Unsupervised learning - In unsupervised learning, we do not have labels or “correct answers” for our data. Clustering (such as K-Means clustering) is an example of unsupervised learning.

  • Supervised learning - In supervised learning, the data (or at least some of the data) does include labels. Our goal in supervised learning is to define a model that can correctly output these labels.

Within supervised learning, there are again two major sub-categories.

  • Regression - The target is a numeric value, such as the price of a house. Linear regression or polynomial regression are the examples of regression we have seen.

  • Classification - The target value is one out of a discrete set of categories, such as Yes or No, whether a house sold, or a species of penguin. Somewhat confusingly, logistic regression is an example of classification (not of regression).

In class, we briefly discussed why handwritten digit classification should be considered a classification problem, even though the outputs can be considered to be numbers. One intuition for this is to ask yourself, if there were an output like 4.3, how would you want to interpret that? If we are estimating “number of bedrooms” or “years in college”, this kind of output makes perfect sense. But if we are trying to classify a handwritten digit, an answer like 4.3 does not make sense. (A digit 4 looks nothing like a digit 5, so for the purposes of this problem, the ordering of the digits does not matter.)

The sigmoid function

The basic idea of logistic regression (in the case of two outputs) is to try to write a function which outputs, not the class directly, but the probability of that class. The probability is modeled by the sigmoid function,

\[ \sigma(x) = \frac{e^x}{e^x+1} = \frac{1}{1+e^{-x}} \]

Here is the graph of the sigmoid function:

graph of sigmoid Source: Wikimedia commons

If you’d like to read more details about logistic regression, here are some recommended resources.

Penguins example

  • Using the penguins dataset from Seaborn, fit a logistic regression model to classify whether or not a penguin is in the Chinstrap species, using its flipper length and its bill length.

import numpy as np
import pandas as pd
import altair as alt
import seaborn as sns
df = sns.load_dataset("penguins").dropna()
cols = ["flipper_length_mm", "bill_length_mm"]
alt.Chart(df).mark_circle().encode(
    x=alt.X("flipper_length_mm", scale=alt.Scale(zero=False)),
    y=alt.Y("bill_length_mm", scale=alt.Scale(zero=False)),
    color="species"
)
df["species"] == "Chinstrap"
0      False
1      False
2      False
4      False
5      False
       ...  
338    False
340    False
341    False
342    False
343    False
Name: species, Length: 333, dtype: bool
(df["species"] == "Chinstrap").sum()
68

We are performing binary classification in this example, meaning that there are only two classes we are considering, “Chinstrap” and “Other”. We add a Boolean column to the DataFrame with True if the species is “Chinstrap” and with False otherwise.

One advantage to performing binary classification, as opposed to multiclass classification, is that the coefficients found during this fitting process will be easier to interpret.

df["is_Chinstrap"] = (df["species"] == "Chinstrap")
df.head()
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex is_Chinstrap
0 Adelie Torgersen 39.1 18.7 181.0 3750.0 Male False
1 Adelie Torgersen 39.5 17.4 186.0 3800.0 Female False
2 Adelie Torgersen 40.3 18.0 195.0 3250.0 Female False
4 Adelie Torgersen 36.7 19.3 193.0 3450.0 Female False
5 Adelie Torgersen 39.3 20.6 190.0 3650.0 Male False
from sklearn.model_selection import train_test_split
df[cols]
flipper_length_mm bill_length_mm
0 181.0 39.1
1 186.0 39.5
2 195.0 40.3
4 193.0 36.7
5 190.0 39.3
... ... ...
338 214.0 47.2
340 215.0 46.8
341 222.0 50.4
342 212.0 45.2
343 213.0 49.9

333 rows × 2 columns

X_train, X_test, y_train, y_test = train_test_split(df[cols], df["is_Chinstrap"], test_size=0.2, random_state=0)
X_test
flipper_length_mm bill_length_mm
62 185.0 37.6
60 185.0 35.7
283 231.0 54.3
107 190.0 38.2
65 192.0 41.6
... ... ...
122 176.0 40.2
298 215.0 45.2
22 189.0 35.9
151 201.0 41.5
252 207.0 45.1

67 rows × 2 columns

Notice how the index for y_test perfectly matches the index for X_test.

y_test
62     False
60     False
283    False
107    False
65     False
       ...  
122    False
298    False
22     False
151    False
252    False
Name: is_Chinstrap, Length: 67, dtype: bool

To perform logistic regression using scikit-learn, we will follow the usual pattern of import, instantiate, fit, predict.

from sklearn.linear_model import LogisticRegression

It’s best to use a variable name like clf as opposed to reg to remind ourselves that we are performing classification, not regression.

clf = LogisticRegression()
clf.fit(X_train, y_train)
LogisticRegression()
clf.predict(X_test)
array([False, False, False, False, False, False,  True, False, False,
        True, False, False, False,  True, False, False, False, False,
        True, False, False, False, False, False, False, False, False,
       False, False, False, False,  True, False, False, False, False,
        True, False,  True, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False,  True,
       False, False, False, False])

Let’s determine what percentage of accuracy we had on the test set.

y_test
62     False
60     False
283    False
107    False
65     False
       ...  
122    False
298    False
22     False
151    False
252    False
Name: is_Chinstrap, Length: 67, dtype: bool
# How often were we correct?
(clf.predict(X_test) == y_test).sum()
63
# What proportion were we correct?
(clf.predict(X_test) == y_test).sum()/len(X_test)
0.9402985074626866

There is a quicker way to get this number, using the classifier’s score method.

clf.score(X_test, y_test)
0.9402985074626866

This accuracy is so high, we don’t have to be too worried about overfitting, but it’s still a good idea to check and see if the performance is much better on the training set. Here it is a little better.

clf.score(X_train, y_train)
0.9624060150375939

(Fun aside that we briefly discussed. Say we are performing binary classification on the MNIST dataset, trying to determine if a handwritten digit is 5 or not. Why is 90% accuracy not impressive in this case?)

cols
['flipper_length_mm', 'bill_length_mm']

The “learning” in “machine learning” for this case of logistic regression is finding the following three constants, two coefficients and one intercept or bias.

clf.coef_
array([[-0.38263519,  1.17201067]])

Think of the above 2 coefficients as being coefficients of the values in the columns listed in cols.

clf.intercept_
array([20.93321887])

The sigmoid function \(\sigma(x)\) is one of the most natural functions to use if you want to model a probability. (See the graph of sigmoid above.)

sigmoid = lambda x: 1/(1+np.exp(-x))
sigmoid(0)
0.5
sigmoid(10)
0.9999546021312976
cols
['flipper_length_mm', 'bill_length_mm']
clf.coef_
array([[-0.38263519,  1.17201067]])

Here we do array unpacking. We have to use [0] because the array we want is inside another array.

flip_coef, bill_coef = clf.coef_[0]

Here we check that we saved the coefficient correctly.

flip_coef
-0.3826351932141686

What does our model predict if the flipper has length 200mm and the bill has length 50mm?

flip = 200
bill = 50

Here is the value we want to input into the sigmoid function. Because this next equation is a linear function (degree 1), that is why logistic regression is considered a linear model.

flip_coef*flip + bill_coef*bill + clf.intercept_
array([3.00671389])

Let’s plug that number into the sigmoid.

sigmoid(flip_coef*flip + bill_coef*bill + clf.intercept_)
array([0.95287652])

We can interpret this as saying our model thinks our penguin has a 95.3% chance of being a Chinstrap penguin. Do we get the same result using clf.predict?

The following is a very common mistake. We need to input something 2-dimensional, not something like a one-dimensional list.

clf.predict([flip, bill])
/Users/christopherdavis/miniconda3/envs/math10s22/lib/python3.7/site-packages/sklearn/base.py:451: UserWarning: X does not have valid feature names, but LogisticRegression was fitted with feature names
  "X does not have valid feature names, but"
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/var/folders/8j/gshrlmtn7dg4qtztj4d4t_w40000gn/T/ipykernel_7591/2215702033.py in <module>
----> 1 clf.predict([flip, bill])

~/miniconda3/envs/math10s22/lib/python3.7/site-packages/sklearn/linear_model/_base.py in predict(self, X)
    423             Vector containing the class labels for each sample.
    424         """
--> 425         scores = self.decision_function(X)
    426         if len(scores.shape) == 1:
    427             indices = (scores > 0).astype(int)

~/miniconda3/envs/math10s22/lib/python3.7/site-packages/sklearn/linear_model/_base.py in decision_function(self, X)
    405         check_is_fitted(self)
    406 
--> 407         X = self._validate_data(X, accept_sparse="csr", reset=False)
    408         scores = safe_sparse_dot(X, self.coef_.T, dense_output=True) + self.intercept_
    409         return scores.ravel() if scores.shape[1] == 1 else scores

~/miniconda3/envs/math10s22/lib/python3.7/site-packages/sklearn/base.py in _validate_data(self, X, y, reset, validate_separately, **check_params)
    564             raise ValueError("Validation should be done on X, y or both.")
    565         elif not no_val_X and no_val_y:
--> 566             X = check_array(X, **check_params)
    567             out = X
    568         elif no_val_X and not no_val_y:

~/miniconda3/envs/math10s22/lib/python3.7/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator)
    771                     "Reshape your data either using array.reshape(-1, 1) if "
    772                     "your data has a single feature or array.reshape(1, -1) "
--> 773                     "if it contains a single sample.".format(array)
    774                 )
    775 

ValueError: Expected 2D array, got 1D array instead:
array=[200  50].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

Here we still get a warning (because our original DataFrame had column names but here we are not using column names), but the computation executes fine.

We get the output of True, which matches our 95.3% confidence, but how do we know our computation was correct?

clf.predict([[flip, bill]])
/Users/christopherdavis/miniconda3/envs/math10s22/lib/python3.7/site-packages/sklearn/base.py:451: UserWarning: X does not have valid feature names, but LogisticRegression was fitted with feature names
  "X does not have valid feature names, but"
array([ True])

We can use predict_proba to see that same 0.95287652 value. It corresponds to the probability of Chinstrap. The other displayed number is the probability of the penguin not being a Chinstrap.

clf.predict_proba([[flip, bill]])
/Users/christopherdavis/miniconda3/envs/math10s22/lib/python3.7/site-packages/sklearn/base.py:451: UserWarning: X does not have valid feature names, but LogisticRegression was fitted with feature names
  "X does not have valid feature names, but"
array([[0.04712348, 0.95287652]])