Logistic Regression¶

Announcements¶

I have notecards. Ask if you need one.
Videos posted for next week. Mostly review of overfitting. Due Tuesday (not the usual Thursday).
Midterm is Thursday next week.
A sample midterm is posted on the Week 8 page on Canvas. Yasmeen will go over solutions in discussion section on Tuesday.
Only one in-class quiz left, Tuesday of Week 10.

The main categories for Machine Learning¶

Machine Learning is divided into two broad categories.

Unsupervised learning - In unsupervised learning, we do not have labels or “correct answers” for our data. Clustering (such as K-Means clustering) is an example of unsupervised learning.
Supervised learning - In supervised learning, the data (or at least some of the data) does include labels. Our goal in supervised learning is to define a model that can correctly output these labels.

Within supervised learning, there are again two major sub-categories.

Regression - The target is a numeric value, such as the price of a house. Linear regression or polynomial regression are the examples of regression we have seen.
Classification - The target value is one out of a discrete set of categories, such as Yes or No, whether a house sold, or a species of penguin. Somewhat confusingly, logistic regression is an example of classification (not of regression).

In class, we briefly discussed why handwritten digit classification should be considered a classification problem, even though the outputs can be considered to be numbers. One intuition for this is to ask yourself, if there were an output like 4.3, how would you want to interpret that? If we are estimating “number of bedrooms” or “years in college”, this kind of output makes perfect sense. But if we are trying to classify a handwritten digit, an answer like 4.3 does not make sense. (A digit 4 looks nothing like a digit 5, so for the purposes of this problem, the ordering of the digits does not matter.)

The sigmoid function¶

The basic idea of logistic regression (in the case of two outputs) is to try to write a function which outputs, not the class directly, but the probability of that class. The probability is modeled by the sigmoid function,

\[ \sigma(x) = \frac{e^x}{e^x+1} = \frac{1}{1+e^{-x}} \]

Here is the graph of the sigmoid function:

graph of sigmoid Source: Wikimedia commons

If you’d like to read more details about logistic regression, here are some recommended resources.

Recommended YouTube video by codebasics.
Wikipedia article. Especially the example on the probability of passing an exam is helpful.
Chapter 4 of Hands on Machine Learning. (Within Chapter 4, scroll down to the section on Logistic Regression.)

Penguins example¶

Using the penguins dataset from Seaborn, fit a logistic regression model to classify whether or not a penguin is in the Chinstrap species, using its flipper length and its bill length.

import numpy as np
import pandas as pd
import altair as alt
import seaborn as sns

df = sns.load_dataset("penguins").dropna()

cols = ["flipper_length_mm", "bill_length_mm"]

alt.Chart(df).mark_circle().encode(
    x=alt.X("flipper_length_mm", scale=alt.Scale(zero=False)),
    y=alt.Y("bill_length_mm", scale=alt.Scale(zero=False)),
    color="species"
)

df["species"] == "Chinstrap"

    False
    False
    False
    False
    False
       ...  
  False
  False
  False
  False
  False
Name: species, Length: 333, dtype: bool

(df["species"] == "Chinstrap").sum()

We are performing binary classification in this example, meaning that there are only two classes we are considering, “Chinstrap” and “Other”. We add a Boolean column to the DataFrame with True if the species is “Chinstrap” and with False otherwise.

One advantage to performing binary classification, as opposed to multiclass classification, is that the coefficients found during this fitting process will be easier to interpret.

df["is_Chinstrap"] = (df["species"] == "Chinstrap")

df.head()

	species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex	is_Chinstrap
0	Adelie	Torgersen	39.1	18.7	181.0	3750.0	Male	False
1	Adelie	Torgersen	39.5	17.4	186.0	3800.0	Female	False
2	Adelie	Torgersen	40.3	18.0	195.0	3250.0	Female	False
4	Adelie	Torgersen	36.7	19.3	193.0	3450.0	Female	False
5	Adelie	Torgersen	39.3	20.6	190.0	3650.0	Male	False

from sklearn.model_selection import train_test_split

df[cols]

	flipper_length_mm	bill_length_mm
0	181.0	39.1
1	186.0	39.5
2	195.0	40.3
4	193.0	36.7
5	190.0	39.3
...	...	...
338	214.0	47.2
340	215.0	46.8
341	222.0	50.4
342	212.0	45.2
343	213.0	49.9

333 rows × 2 columns

X_train, X_test, y_train, y_test = train_test_split(df[cols], df["is_Chinstrap"], test_size=0.2, random_state=0)

X_test

	flipper_length_mm	bill_length_mm
62	185.0	37.6
60	185.0	35.7
283	231.0	54.3
107	190.0	38.2
65	192.0	41.6
...	...	...
122	176.0	40.2
298	215.0	45.2
22	189.0	35.9
151	201.0	41.5
252	207.0	45.1

67 rows × 2 columns

Notice how the index for y_test perfectly matches the index for X_test.

y_test

   False
   False
  False
  False
   False
       ...  
  False
  False
   False
  False
  False
Name: is_Chinstrap, Length: 67, dtype: bool

To perform logistic regression using scikit-learn, we will follow the usual pattern of import, instantiate, fit, predict.

from sklearn.linear_model import LogisticRegression

It’s best to use a variable name like clf as opposed to reg to remind ourselves that we are performing classification, not regression.

clf = LogisticRegression()

clf.fit(X_train, y_train)

LogisticRegression()

clf.predict(X_test)

array([False, False, False, False, False, False,  True, False, False,
        True, False, False, False,  True, False, False, False, False,
        True, False, False, False, False, False, False, False, False,
       False, False, False, False,  True, False, False, False, False,
        True, False,  True, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False,  True,
       False, False, False, False])

Let’s determine what percentage of accuracy we had on the test set.

y_test

   False
   False
  False
  False
   False
       ...  
  False
  False
   False
  False
  False
Name: is_Chinstrap, Length: 67, dtype: bool

# How often were we correct?
(clf.predict(X_test) == y_test).sum()

# What proportion were we correct?
(clf.predict(X_test) == y_test).sum()/len(X_test)

0.9402985074626866

There is a quicker way to get this number, using the classifier’s score method.

clf.score(X_test, y_test)

0.9402985074626866

This accuracy is so high, we don’t have to be too worried about overfitting, but it’s still a good idea to check and see if the performance is much better on the training set. Here it is a little better.

clf.score(X_train, y_train)

0.9624060150375939

(Fun aside that we briefly discussed. Say we are performing binary classification on the MNIST dataset, trying to determine if a handwritten digit is 5 or not. Why is 90% accuracy not impressive in this case?)

cols

['flipper_length_mm', 'bill_length_mm']

The “learning” in “machine learning” for this case of logistic regression is finding the following three constants, two coefficients and one intercept or bias.

clf.coef_

array([[-0.38263519,  1.17201067]])

Think of the above 2 coefficients as being coefficients of the values in the columns listed in cols.

clf.intercept_

array([20.93321887])

The sigmoid function \(\sigma(x)\) is one of the most natural functions to use if you want to model a probability. (See the graph of sigmoid above.)

sigmoid = lambda x: 1/(1+np.exp(-x))

sigmoid(0)

0.5

sigmoid(10)

0.9999546021312976

cols

['flipper_length_mm', 'bill_length_mm']

clf.coef_

array([[-0.38263519,  1.17201067]])

Here we do array unpacking. We have to use [0] because the array we want is inside another array.

flip_coef, bill_coef = clf.coef_[0]

Here we check that we saved the coefficient correctly.

flip_coef

-0.3826351932141686

What does our model predict if the flipper has length 200mm and the bill has length 50mm?

flip = 200
bill = 50

Here is the value we want to input into the sigmoid function. Because this next equation is a linear function (degree 1), that is why logistic regression is considered a linear model.

flip_coef*flip + bill_coef*bill + clf.intercept_

array([3.00671389])

Let’s plug that number into the sigmoid.

sigmoid(flip_coef*flip + bill_coef*bill + clf.intercept_)

array([0.95287652])

We can interpret this as saying our model thinks our penguin has a 95.3% chance of being a Chinstrap penguin. Do we get the same result using clf.predict?

The following is a very common mistake. We need to input something 2-dimensional, not something like a one-dimensional list.

clf.predict([flip, bill])

/Users/christopherdavis/miniconda3/envs/math10s22/lib/python3.7/site-packages/sklearn/base.py:451: UserWarning: X does not have valid feature names, but LogisticRegression was fitted with feature names
  "X does not have valid feature names, but"

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/var/folders/8j/gshrlmtn7dg4qtztj4d4t_w40000gn/T/ipykernel_7591/2215702033.py in <module>
----> 1 clf.predict([flip, bill])

~/miniconda3/envs/math10s22/lib/python3.7/site-packages/sklearn/linear_model/_base.py in predict(self, X)
    423             Vector containing the class labels for each sample.
    424         """
--> 425         scores = self.decision_function(X)
    426         if len(scores.shape) == 1:
    427             indices = (scores > 0).astype(int)

~/miniconda3/envs/math10s22/lib/python3.7/site-packages/sklearn/linear_model/_base.py in decision_function(self, X)
    405         check_is_fitted(self)
    406 
--> 407         X = self._validate_data(X, accept_sparse="csr", reset=False)
    408         scores = safe_sparse_dot(X, self.coef_.T, dense_output=True) + self.intercept_
    409         return scores.ravel() if scores.shape[1] == 1 else scores

~/miniconda3/envs/math10s22/lib/python3.7/site-packages/sklearn/base.py in _validate_data(self, X, y, reset, validate_separately, **check_params)
    564             raise ValueError("Validation should be done on X, y or both.")
    565         elif not no_val_X and no_val_y:
--> 566             X = check_array(X, **check_params)
    567             out = X
    568         elif no_val_X and not no_val_y:

~/miniconda3/envs/math10s22/lib/python3.7/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator)
    771                     "Reshape your data either using array.reshape(-1, 1) if "
    772                     "your data has a single feature or array.reshape(1, -1) "
--> 773                     "if it contains a single sample.".format(array)
    774                 )
    775 

ValueError: Expected 2D array, got 1D array instead:
array=[200  50].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

Here we still get a warning (because our original DataFrame had column names but here we are not using column names), but the computation executes fine.

We get the output of True, which matches our 95.3% confidence, but how do we know our computation was correct?

clf.predict([[flip, bill]])

/Users/christopherdavis/miniconda3/envs/math10s22/lib/python3.7/site-packages/sklearn/base.py:451: UserWarning: X does not have valid feature names, but LogisticRegression was fitted with feature names
  "X does not have valid feature names, but"

array([ True])

We can use predict_proba to see that same 0.95287652 value. It corresponds to the probability of Chinstrap. The other displayed number is the probability of the penguin not being a Chinstrap.

clf.predict_proba([[flip, bill]])

/Users/christopherdavis/miniconda3/envs/math10s22/lib/python3.7/site-packages/sklearn/base.py:451: UserWarning: X does not have valid feature names, but LogisticRegression was fitted with feature names
  "X does not have valid feature names, but"

array([[0.04712348, 0.95287652]])

UC Irvine Math 10 S22

Logistic Regression

Contents

Logistic Regression¶

Announcements¶

The main categories for Machine Learning¶

The sigmoid function¶

Penguins example¶