Homework 4
Contents
Homework 4¶
Author: BLANK
Collaborators: BLANK
Introduction¶
In this homework, we will use K-Means clustering on the “penguins” dataset to divide the data into 3 clusters. The penguins dataset contains 3 different species of penguin. Do different species of penguin appear in different clusters?
Question 1¶
Load the “penguins” dataset from Seaborn. Do not drop any rows yet.
Define a list
numcolscontaining the names of the columns which contain numeric values. Create this list using list comprehension and the pandas functionis_numeric_dtypeas described in this Stack Overflow answer.
Question 2¶
Drop the rows in the DataFrame which are missing values in any of numeric columns. (Don’t drop any additional rows.)
Call the result
df. Use.copy()after your definition to make suredfis a brand new DataFrame.Check your answer. This resulting DataFrame should have 342 rows and 7 columns.
Question 3¶
Instantiate a new
KMeansobject from the scikit-learn library, and specify that we want to find 3 different clusters.Using the numeric columns from
dfand theKMeansobject, compute clusters for this dataset, and store them in a new column named “cluster”.Check your answer: the cluster sizes should be approximately 170, 111, 61. (There is some randomness in the clustering algorithm, so your numbers could be different. Another time I got the sizes 165, 107, 70.)
Question 4¶
Make an Altair scatter plot chart, stored with the variable name
c, that contains “flipper_length_mm” for the x-coordinate and “bill_length_mm” for the y-coordinate.Use
scale=alt.Scale(zero=False)in both the x-axis and the y-axis so that zero is not included in the axis domains.Look at this chart. Do you have a prediction for what the clusters will be?
Question 5¶
Make a new chart
c_clusterfromc, which encodes the “cluster” column as the color of the chart. You should not retype any of the definition fromc. Instead just usec.encode(color=???)to definec_cluster.Should the “cluster” value be encoded as a quantitative, a nominal, or an ordinal data encoding type? Make this specification when you make your
c_clusterchart.Display
c_cluster. Does it match what you expected from the clustering?
Question 6¶
Make a new Altair chart, named
c_species, which is derived fromcthis time by encoding the “species” column as the color.Display this new chart.
Which chart seems to display more natural clusters?
Question 7¶
Define a new Altair chart
c_cluster2to be a copy ofc_cluster. (Altair charts have acopymethod, just like pandas DataFrames.)Change the x-axis encoding for
c_cluster2so that it uses “body_mass_g” instead of “flipper_length_mm”.Display
c_cluster2.Why do you think the divisions in this clustering look so much more even than the divisions for from the original
c_cluster?
Question 8¶
Define a new DataFrame
df1which has the same data asdf, but in which the numerical data is rescaled so that the columns fromnumcolshave mean 0 and standard deviation 1. UseStandardScaler.Check your work by calling the
describemethod ofdf1. You should see a mean of (approximately) 0 and a standard deviation of (approximately) 1 in each column.
Question 9¶
Use a
KMeansobject to compute three clusters fordf1usingnumcols, and again store the result in the “cluster” column.
Question 10¶
Define new Altair charts
c_cluster_scaledandc_cluster2_scaledto be copies of thec_clusterandc_cluster2charts from above.You can change the DataFrame used for
c_cluster_scaledby usingc_cluster_scaled.data = df1. Do this for bothc_cluster_scaledand forc_cluster2_scaled.Display the resulting charts.
Question 11¶
Evaluate
for a,b in df.groupby("species"):
print(a)
print(b["cluster"].value_counts())
and
for a,b in df1.groupby("species"):
print(a)
print(b["cluster"].value_counts())
How does this information convey that the
df1clusters do a better job of capturing the divisions into different species?
Submission¶
To submit this homework, go to the Share option at the top right, and share the project to create a link, and then submit that link on Canvas.