Worksheet 13#
You are encouraged to work in groups of up to 3 total students, but each student should make their own submission on Canvas. (It’s fine for everyone in the group to have the same upload.)
Put the full names of everyone in your group (even if you’re working alone) here. (This makes grading easier.)
Names:
Introduction#
Our goal in this worksheet is to predict whether the species of a penguin is Adelie using its bill length as our only input feature.
Import the penguins dataset from Seaborn, drop the rows in which the “bill_length_mm” column contains missing data, and save the resulting it in the variable name
df
. (Hint. I don’t see a way to do this usingdropna
. I recommend using Boolean indexing together with theisna
method and negation~
or else with thenotna
method. Call thecopy
method when you perform the Boolean indexing, to prevent warnings below.)Check your answer. The DataFrame should have
342
rows.
Add a new column to
df
called “isAdelie_bool” which containsTrue
if the species is “Adelie” and which containsFalse
otherwise.
Add a new column to
df
called “isAdelie” which contains1
if the species is “Adelie” and which contains0
otherwise. (Suggestion. Call theastype
method on the “isAdelie_bool” column and specify that the values should be converted to integers.)
Using Altair, draw a scatter plot encoding the “bill_length_mm” column in the x-axis channel and encoding “isAdelie” in the y-axis channel. Store the resulting chart with the variable name
c
.
Why not use linear regression?#
Fit a
LinearRegression
object to this data, using “bill_length_mm” as the only input feature, and using “isAdelie” as the target.
Add a column “pred_lin” to the DataFrame containing the predicted outputs from this linear regression model (the inputs should still be the “bill_length_mm” column).
Make a new Altair chart
c2
, this time a line chart, using “bill_length_mm” for the x-axis channel and using “pred_lin” for the y-axis channel. Draw the line in the color “red”.
Display
c2
layered on top ofc
from above.
Using logistic regression#
Fit a
LogisticRegression
object to this same data. (Don’t name the objectreg
. Name itclf
instead, for “classifier”.)
Add a new column to
df
named “pred_log” which contains the predictions corresponding toclf
.
Make a new Altair chart
c3
, this time back to being a scatter plot, using “bill_length_mm” for the x-axis channel and using “pred_log” for the y-axis channel. Draw the points in the color “red”.
Display
c3
layered on top ofc
from above.
Assessing the accuracy#
What proportion of these predicted values are correct? We will check this two different ways.
What proportion (as a float between 0 and 1) of the values predicted by logistic regression are correct? You should not use
clf
to answer this question, just compare the “isAdelie” column to the “pred_log” column.
Caution. If you find yourself using a for-loop or list comprehension or anything like that, you should look for a more succinct approach. Make a Boolean Series and call the mean
method.
Now call the
score
method ofclf
, and input the input features along with the true outputs. You should get the same number as above (up to maybe some small numerical precision issues).
Probabilities#
Call the
predict_proba
method ofclf
, using the same input features as above. This will report the predicted probabilities. Name the resultarr
.
What is the type of the output of
predict_proba
? What is the shape of the output?
What is the row at index
8
ofarr
?
This means our model predicts there is about a 26% chance the penguin is something, and about a 74% chance it is not. But which number corresponds to which? Use the
classes_
attribute ofclf
to decide which is which.
Can you calculate one of these values explicitly using the sigmoid function \(1/(1+e^{-x})\)? You will need to use the fit coefficient and intercept from
clf
along with the input value corresponding to the row and integer location8
indf
.
Extract the column of
arr
corresponding to the probability that the penguin is an Adelie species. Store this in a new column ofdf
named “pred_proba”.
Make a new Altair line chart
c4
, using the color red, and with “bill_length_mm” for the x-axis and with “pred_proba” for the y-axis.
Layer this chart on top of
c
from above.
If you wanted to locate the penguin from the row at integer location
8
in the above plot, how would you try to do that? Briefly answer in a markdown cell (not in a Python comment).
Briefly explain in your own words, in terms of the above plots, why does logistic regression seem more natural to use for this task than linear regression? (Your answer should make reference to something about how these charts look. Using 1-2 sentences is enough.)
Submission#
Reminder: everyone needs to make a submission on Canvas.
Reminder: include everyone’s full name at the top, after Names.
Using the
Share
button at the top right, enable public sharing, and enable Comment privileges. Then submit the created link on Canvas.