Homework 4¶

Author: BLANK

Collaborators: BLANK

Introduction¶

In this homework, we will use K-Means clustering on the “penguins” dataset to divide the data into 3 clusters. The penguins dataset contains 3 different species of penguin. Do different species of penguin appear in different clusters?

Question 1¶

Load the “penguins” dataset from Seaborn. Do not drop any rows yet.
Define a list numcols containing the names of the columns which contain numeric values. Create this list using list comprehension and the pandas function is_numeric_dtype as described in this Stack Overflow answer.

Question 2¶

Drop the rows in the DataFrame which are missing values in any of numeric columns. (Don’t drop any additional rows.)
Call the result df. Use .copy() after your definition to make sure df is a brand new DataFrame.
Check your answer. This resulting DataFrame should have 342 rows and 7 columns.

Question 3¶

Instantiate a new KMeans object from the scikit-learn library, and specify that we want to find 3 different clusters.
Using the numeric columns from df and the KMeans object, compute clusters for this dataset, and store them in a new column named “cluster”.
Check your answer: the cluster sizes should be approximately 170, 111, 61. (There is some randomness in the clustering algorithm, so your numbers could be different. Another time I got the sizes 165, 107, 70.)

Question 4¶

Make an Altair scatter plot chart, stored with the variable name c, that contains “flipper_length_mm” for the x-coordinate and “bill_length_mm” for the y-coordinate.
Use scale=alt.Scale(zero=False) in both the x-axis and the y-axis so that zero is not included in the axis domains.
Look at this chart. Do you have a prediction for what the clusters will be?

Question 5¶

Make a new chart c_cluster from c, which encodes the “cluster” column as the color of the chart. You should not retype any of the definition from c. Instead just use c.encode(color=???) to define c_cluster.
Should the “cluster” value be encoded as a quantitative, a nominal, or an ordinal data encoding type? Make this specification when you make your c_cluster chart.
Display c_cluster. Does it match what you expected from the clustering?

Question 6¶

Make a new Altair chart, named c_species, which is derived from c this time by encoding the “species” column as the color.
Display this new chart.
Which chart seems to display more natural clusters?

Question 7¶

Define a new Altair chart c_cluster2 to be a copy of c_cluster. (Altair charts have a copy method, just like pandas DataFrames.)
Change the x-axis encoding for c_cluster2 so that it uses “body_mass_g” instead of “flipper_length_mm”.
Display c_cluster2.
Why do you think the divisions in this clustering look so much more even than the divisions for from the original c_cluster?

Question 8¶

Define a new DataFrame df1 which has the same data as df, but in which the numerical data is rescaled so that the columns from numcols have mean 0 and standard deviation 1. Use StandardScaler.
Check your work by calling the describe method of df1. You should see a mean of (approximately) 0 and a standard deviation of (approximately) 1 in each column.

Question 9¶

Use a KMeans object to compute three clusters for df1 using numcols, and again store the result in the “cluster” column.

Question 10¶

Define new Altair charts c_cluster_scaled and c_cluster2_scaled to be copies of the c_cluster and c_cluster2 charts from above.
You can change the DataFrame used for c_cluster_scaled by using c_cluster_scaled.data = df1. Do this for both c_cluster_scaled and for c_cluster2_scaled.
Display the resulting charts.

Question 11¶

Evaluate

for a,b in df.groupby("species"):
    print(a)
    print(b["cluster"].value_counts())

and

for a,b in df1.groupby("species"):
    print(a)
    print(b["cluster"].value_counts())

How does this information convey that the df1 clusters do a better job of capturing the divisions into different species?

Submission¶

To submit this homework, go to the Share option at the top right, and share the project to create a link, and then submit that link on Canvas.

Created in Deepnote

UC Irvine Math 10 S22

Homework 4

Contents