Homework 4
Contents
Homework 4¶
Author: BLANK
Collaborators: BLANK
Introduction¶
In this homework, we will use K-Means clustering on the “penguins” dataset to divide the data into 3 clusters. The penguins dataset contains 3 different species of penguin. Do different species of penguin appear in different clusters?
Question 1¶
Load the “penguins” dataset from Seaborn. Do not drop any rows yet.
Define a list
numcols
containing the names of the columns which contain numeric values. Create this list using list comprehension and the pandas functionis_numeric_dtype
as described in this Stack Overflow answer.
Question 2¶
Drop the rows in the DataFrame which are missing values in any of numeric columns. (Don’t drop any additional rows.)
Call the result
df
. Use.copy()
after your definition to make suredf
is a brand new DataFrame.Check your answer. This resulting DataFrame should have 342 rows and 7 columns.
Question 3¶
Instantiate a new
KMeans
object from the scikit-learn library, and specify that we want to find 3 different clusters.Using the numeric columns from
df
and theKMeans
object, compute clusters for this dataset, and store them in a new column named “cluster”.Check your answer: the cluster sizes should be approximately 170, 111, 61. (There is some randomness in the clustering algorithm, so your numbers could be different. Another time I got the sizes 165, 107, 70.)
Question 4¶
Make an Altair scatter plot chart, stored with the variable name
c
, that contains “flipper_length_mm” for the x-coordinate and “bill_length_mm” for the y-coordinate.Use
scale=alt.Scale(zero=False)
in both the x-axis and the y-axis so that zero is not included in the axis domains.Look at this chart. Do you have a prediction for what the clusters will be?
Question 5¶
Make a new chart
c_cluster
fromc
, which encodes the “cluster” column as the color of the chart. You should not retype any of the definition fromc
. Instead just usec.encode(color=???)
to definec_cluster
.Should the “cluster” value be encoded as a quantitative, a nominal, or an ordinal data encoding type? Make this specification when you make your
c_cluster
chart.Display
c_cluster
. Does it match what you expected from the clustering?
Question 6¶
Make a new Altair chart, named
c_species
, which is derived fromc
this time by encoding the “species” column as the color.Display this new chart.
Which chart seems to display more natural clusters?
Question 7¶
Define a new Altair chart
c_cluster2
to be a copy ofc_cluster
. (Altair charts have acopy
method, just like pandas DataFrames.)Change the x-axis encoding for
c_cluster2
so that it uses “body_mass_g” instead of “flipper_length_mm”.Display
c_cluster2
.Why do you think the divisions in this clustering look so much more even than the divisions for from the original
c_cluster
?
Question 8¶
Define a new DataFrame
df1
which has the same data asdf
, but in which the numerical data is rescaled so that the columns fromnumcols
have mean 0 and standard deviation 1. UseStandardScaler
.Check your work by calling the
describe
method ofdf1
. You should see a mean of (approximately) 0 and a standard deviation of (approximately) 1 in each column.
Question 9¶
Use a
KMeans
object to compute three clusters fordf1
usingnumcols
, and again store the result in the “cluster” column.
Question 10¶
Define new Altair charts
c_cluster_scaled
andc_cluster2_scaled
to be copies of thec_cluster
andc_cluster2
charts from above.You can change the DataFrame used for
c_cluster_scaled
by usingc_cluster_scaled.data = df1
. Do this for bothc_cluster_scaled
and forc_cluster2_scaled
.Display the resulting charts.
Question 11¶
Evaluate
for a,b in df.groupby("species"):
print(a)
print(b["cluster"].value_counts())
and
for a,b in df1.groupby("species"):
print(a)
print(b["cluster"].value_counts())
How does this information convey that the
df1
clusters do a better job of capturing the divisions into different species?
Submission¶
To submit this homework, go to the Share option at the top right, and share the project to create a link, and then submit that link on Canvas.
Created in Deepnote