Worksheet 13#

You are encouraged to work in groups of up to 3 total students, but each student should make their own submission on Canvas. (It’s fine for everyone in the group to have the same upload.)

Put the full names of everyone in your group (even if you’re working alone) here. (This makes grading easier.)

  • Names:

Introduction#

Our goal in this worksheet is to predict whether the species of a penguin is Adelie using its bill length as our only input feature.

  • Import the penguins dataset from Seaborn, drop the rows in which the “bill_length_mm” column contains missing data, and save the resulting it in the variable name df. (Hint. I don’t see a way to do this using dropna. I recommend using Boolean indexing together with the isna method and negation ~ or else with the notna method. Call the copy method when you perform the Boolean indexing, to prevent warnings below.)

  • Check your answer. The DataFrame should have 342 rows.

  • Add a new column to df called “isAdelie_bool” which contains True if the species is “Adelie” and which contains False otherwise.

  • Add a new column to df called “isAdelie” which contains 1 if the species is “Adelie” and which contains 0 otherwise. (Suggestion. Call the astype method on the “isAdelie_bool” column and specify that the values should be converted to integers.)

  • Using Altair, draw a scatter plot encoding the “bill_length_mm” column in the x-axis channel and encoding “isAdelie” in the y-axis channel. Store the resulting chart with the variable name c.

Why not use linear regression?#

  • Fit a LinearRegression object to this data, using “bill_length_mm” as the only input feature, and using “isAdelie” as the target.

  • Add a column “pred_lin” to the DataFrame containing the predicted outputs from this linear regression model (the inputs should still be the “bill_length_mm” column).

  • Make a new Altair chart c2, this time a line chart, using “bill_length_mm” for the x-axis channel and using “pred_lin” for the y-axis channel. Draw the line in the color “red”.

  • Display c2 layered on top of c from above.

Using logistic regression#

  • Fit a LogisticRegression object to this same data. (Don’t name the object reg. Name it clf instead, for “classifier”.)

  • Add a new column to df named “pred_log” which contains the predictions corresponding to clf.

  • Make a new Altair chart c3, this time back to being a scatter plot, using “bill_length_mm” for the x-axis channel and using “pred_log” for the y-axis channel. Draw the points in the color “red”.

  • Display c3 layered on top of c from above.

Assessing the accuracy#

What proportion of these predicted values are correct? We will check this two different ways.

  • What proportion (as a float between 0 and 1) of the values predicted by logistic regression are correct? You should not use clf to answer this question, just compare the “isAdelie” column to the “pred_log” column.

Caution. If you find yourself using a for-loop or list comprehension or anything like that, you should look for a more succinct approach. Make a Boolean Series and call the mean method.

  • Now call the score method of clf, and input the input features along with the true outputs. You should get the same number as above (up to maybe some small numerical precision issues).

Probabilities#

  • Call the predict_proba method of clf, using the same input features as above. This will report the predicted probabilities. Name the result arr.

  • What is the type of the output of predict_proba? What is the shape of the output?

  • What is the row at index 8 of arr?

  • This means our model predicts there is about a 26% chance the penguin is something, and about a 74% chance it is not. But which number corresponds to which? Use the classes_ attribute of clf to decide which is which.

  • Can you calculate one of these values explicitly using the sigmoid function \(1/(1+e^{-x})\)? You will need to use the fit coefficient and intercept from clf along with the input value corresponding to the row and integer location 8 in df.

  • Extract the column of arr corresponding to the probability that the penguin is an Adelie species. Store this in a new column of df named “pred_proba”.

  • Make a new Altair line chart c4, using the color red, and with “bill_length_mm” for the x-axis and with “pred_proba” for the y-axis.

  • Layer this chart on top of c from above.

  • If you wanted to locate the penguin from the row at integer location 8 in the above plot, how would you try to do that? Briefly answer in a markdown cell (not in a Python comment).

  • Briefly explain in your own words, in terms of the above plots, why does logistic regression seem more natural to use for this task than linear regression? (Your answer should make reference to something about how these charts look. Using 1-2 sentences is enough.)

Submission#

  • Reminder: everyone needs to make a submission on Canvas.

  • Reminder: include everyone’s full name at the top, after Names.

  • Using the Share button at the top right, enable public sharing, and enable Comment privileges. Then submit the created link on Canvas.