Worksheet 13#
You are encouraged to work in groups of up to 3 total students, but each student should make their own submission on Canvas. (It’s fine for everyone in the group to have the same upload.)
Put the full names of everyone in your group (even if you’re working alone) here. (This makes grading easier.)
Names:
Introduction#
Our goal in this worksheet is to predict whether the species of a penguin is Adelie using its bill length as our only input feature.
Import the penguins dataset from Seaborn, drop the rows in which the “bill_length_mm” column contains missing data, and save the resulting it in the variable name
df. (Hint. I don’t see a way to do this usingdropna. I recommend using Boolean indexing together with theisnamethod and negation~or else with thenotnamethod. Call thecopymethod when you perform the Boolean indexing, to prevent warnings below.)Check your answer. The DataFrame should have
342rows.
Add a new column to
dfcalled “isAdelie_bool” which containsTrueif the species is “Adelie” and which containsFalseotherwise.
Add a new column to
dfcalled “isAdelie” which contains1if the species is “Adelie” and which contains0otherwise. (Suggestion. Call theastypemethod on the “isAdelie_bool” column and specify that the values should be converted to integers.)
Using Altair, draw a scatter plot encoding the “bill_length_mm” column in the x-axis channel and encoding “isAdelie” in the y-axis channel. Store the resulting chart with the variable name
c.
Why not use linear regression?#
Fit a
LinearRegressionobject to this data, using “bill_length_mm” as the only input feature, and using “isAdelie” as the target.
Add a column “pred_lin” to the DataFrame containing the predicted outputs from this linear regression model (the inputs should still be the “bill_length_mm” column).
Make a new Altair chart
c2, this time a line chart, using “bill_length_mm” for the x-axis channel and using “pred_lin” for the y-axis channel. Draw the line in the color “red”.
Display
c2layered on top ofcfrom above.
Using logistic regression#
Fit a
LogisticRegressionobject to this same data. (Don’t name the objectreg. Name itclfinstead, for “classifier”.)
Add a new column to
dfnamed “pred_log” which contains the predictions corresponding toclf.
Make a new Altair chart
c3, this time back to being a scatter plot, using “bill_length_mm” for the x-axis channel and using “pred_log” for the y-axis channel. Draw the points in the color “red”.
Display
c3layered on top ofcfrom above.
Assessing the accuracy#
What proportion of these predicted values are correct? We will check this two different ways.
What proportion (as a float between 0 and 1) of the values predicted by logistic regression are correct? You should not use
clfto answer this question, just compare the “isAdelie” column to the “pred_log” column.
Caution. If you find yourself using a for-loop or list comprehension or anything like that, you should look for a more succinct approach. Make a Boolean Series and call the mean method.
Now call the
scoremethod ofclf, and input the input features along with the true outputs. You should get the same number as above (up to maybe some small numerical precision issues).
Probabilities#
Call the
predict_probamethod ofclf, using the same input features as above. This will report the predicted probabilities. Name the resultarr.
What is the type of the output of
predict_proba? What is the shape of the output?
What is the row at index
8ofarr?
This means our model predicts there is about a 26% chance the penguin is something, and about a 74% chance it is not. But which number corresponds to which? Use the
classes_attribute ofclfto decide which is which.
Can you calculate one of these values explicitly using the sigmoid function \(1/(1+e^{-x})\)? You will need to use the fit coefficient and intercept from
clfalong with the input value corresponding to the row and integer location8indf.
Extract the column of
arrcorresponding to the probability that the penguin is an Adelie species. Store this in a new column ofdfnamed “pred_proba”.
Make a new Altair line chart
c4, using the color red, and with “bill_length_mm” for the x-axis and with “pred_proba” for the y-axis.
Layer this chart on top of
cfrom above.
If you wanted to locate the penguin from the row at integer location
8in the above plot, how would you try to do that? Briefly answer in a markdown cell (not in a Python comment).
Briefly explain in your own words, in terms of the above plots, why does logistic regression seem more natural to use for this task than linear regression? (Your answer should make reference to something about how these charts look. Using 1-2 sentences is enough.)
Submission#
Reminder: everyone needs to make a submission on Canvas.
Reminder: include everyone’s full name at the top, after Names.
Using the
Sharebutton at the top right, enable public sharing, and enable Comment privileges. Then submit the created link on Canvas.