Worksheet on Logistic Regression

You can download this Jupyter notebook by clicking the download arrow at the top right, and then choosing the .ipynb file.

In this worksheet, we use logistic regression to try to distinguish between two species of iris (the flower). This is one of the classic datasets in machine learning. We will load it from vega_datasets, which you may need to install, pip install vega_datasets.

Warm-up example

Here is an example of loading the cars dataset from vega_datasets, and then plotting scatter-plots for all the pairs of numeric columns in that DataFrame.

import numpy as np
import pandas as pd
from pandas.api.types import is_numeric_dtype
import altair as alt
from vega_datasets import data
source = data.cars()
num_cols = [c for c in source.columns if is_numeric_dtype(source[c])]
num_cols
alt.Chart(source).mark_circle().encode(
    alt.X(alt.repeat("column"), type='quantitative'),
    alt.Y(alt.repeat("row"), type='quantitative'),
    color='Origin:N'
).properties(
    width=100,
    height=100
).repeat(
    row=num_cols,
    column=num_cols
)

Question 1

Make the same sort of chart with the iris dataset from vega_datasets. The grid should be 4x4.

Question 2

How many of each species of iris are included in the dataset? Compute this using value_counts.

Question 3

Get the sub-dataframe, say df_sub, where the iris species is “versicolor” or “virginica” (so the length of the DataFrame should be 100). One nice way to do this is with the isin method, which I don’t think we’ve used before. Probably a more familiar way is to use |, meaning “or”.

Question 4

Fit a LogisticRegression object to this df_sub data, using only the two predictors “sepalLength” and “petalLength” as input variables, and using “species” as the target. You don’t need to do any scaling in this case. (In this case, the units are the same on all the different columns, so it’s at least justifiable to compare them without scaling. That doesn’t mean it would be wrong to scale them.) Find the corresponding classes, coefficients, and intercept. (There should be two classes, two coefficients, and one intercept.)

from sklearn.linear_model import LogisticRegression
clf = LogisticRegression()
predictors = ["sepalLength", "petalLength"]

Question 5

Here is an example of plotting a line on a scatter plot in Altair. The line is specified by just “connecting the dots”, in this case between (4,4) and (7,1).

Make the following changes to the plot.

  1. Change from source to df_sub, so that the “setosa” species goes away.

  2. Change the line so that it corresponds to the “decision boundary”, where our classifier says that both species are equally likely, according to the coefficients and intercept that was found above.

source = data.iris()

point_plot = alt.Chart(source).mark_circle().encode(
    x = alt.X("sepalLength",scale=alt.Scale(zero=False)),
    y = alt.Y("petalLength",scale=alt.Scale(zero=False)),
    color='species'
)

line_df = pd.DataFrame({
    'sepalLength': [4, 7],
    'petalLength': [4, 1],
})

line_plot = alt.Chart(line_df).mark_line(color= 'red').encode(
    x= 'sepalLength',
    y= 'petalLength',
)

point_plot + line_plot

Question 6

Now find the line where we predict a 90% chance of being in the versicolor species.