Homework 4

Author: BLANK

Collaborators: BLANK

Introduction

In this homework, we will use K-Means clustering on the “penguins” dataset to divide the data into 3 clusters. The penguins dataset contains 3 different species of penguin. Do different species of penguin appear in different clusters?

Question 1

  • Load the “penguins” dataset from Seaborn. Do not drop any rows yet.

  • Define a list numcols containing the names of the columns which contain numeric values. Create this list using list comprehension and the pandas function is_numeric_dtype as described in this Stack Overflow answer.

Question 2

  • Drop the rows in the DataFrame which are missing values in any of numeric columns. (Don’t drop any additional rows.)

  • Call the result df. Use .copy() after your definition to make sure df is a brand new DataFrame.

  • Check your answer. This resulting DataFrame should have 342 rows and 7 columns.

Question 3

  • Instantiate a new KMeans object from the scikit-learn library, and specify that we want to find 3 different clusters.

  • Using the numeric columns from df and the KMeans object, compute clusters for this dataset, and store them in a new column named “cluster”.

  • Check your answer: the cluster sizes should be approximately 170, 111, 61. (There is some randomness in the clustering algorithm, so your numbers could be different. Another time I got the sizes 165, 107, 70.)

Question 4

  • Make an Altair scatter plot chart, stored with the variable name c, that contains “flipper_length_mm” for the x-coordinate and “bill_length_mm” for the y-coordinate.

  • Use scale=alt.Scale(zero=False) in both the x-axis and the y-axis so that zero is not included in the axis domains.

  • Look at this chart. Do you have a prediction for what the clusters will be?

Question 5

  • Make a new chart c_cluster from c, which encodes the “cluster” column as the color of the chart. You should not retype any of the definition from c. Instead just use c.encode(color=???) to define c_cluster.

  • Should the “cluster” value be encoded as a quantitative, a nominal, or an ordinal data encoding type? Make this specification when you make your c_cluster chart.

  • Display c_cluster. Does it match what you expected from the clustering?

Question 6

  • Make a new Altair chart, named c_species, which is derived from c this time by encoding the “species” column as the color.

  • Display this new chart.

  • Which chart seems to display more natural clusters?

Question 7

  • Define a new Altair chart c_cluster2 to be a copy of c_cluster. (Altair charts have a copy method, just like pandas DataFrames.)

  • Change the x-axis encoding for c_cluster2 so that it uses “body_mass_g” instead of “flipper_length_mm”.

  • Display c_cluster2.

  • Why do you think the divisions in this clustering look so much more even than the divisions for from the original c_cluster?

Question 8

  • Define a new DataFrame df1 which has the same data as df, but in which the numerical data is rescaled so that the columns from numcols have mean 0 and standard deviation 1. Use StandardScaler.

  • Check your work by calling the describe method of df1. You should see a mean of (approximately) 0 and a standard deviation of (approximately) 1 in each column.

Question 9

  • Use a KMeans object to compute three clusters for df1 using numcols, and again store the result in the “cluster” column.

Question 10

  • Define new Altair charts c_cluster_scaled and c_cluster2_scaled to be copies of the c_cluster and c_cluster2 charts from above.

  • You can change the DataFrame used for c_cluster_scaled by using c_cluster_scaled.data = df1. Do this for both c_cluster_scaled and for c_cluster2_scaled.

  • Display the resulting charts.

Question 11

  • Evaluate

for a,b in df.groupby("species"):
    print(a)
    print(b["cluster"].value_counts())

and

for a,b in df1.groupby("species"):
    print(a)
    print(b["cluster"].value_counts())
  • How does this information convey that the df1 clusters do a better job of capturing the divisions into different species?

Submission

To submit this homework, go to the Share option at the top right, and share the project to create a link, and then submit that link on Canvas.

Created in deepnote.com Created in Deepnote