Week 5, Thursday Discussion
Contents
Week 5, Thursday Discussion¶
Reminders:
Homework #4 due Tuesday 11:59pm
Quiz #4 Tuesday during last 20 minutes of discussion
Today:
Work on Homework #4
Try this worksheet, if done with homework already
Data cleaning¶
We already know a good way to import the Spotify dataset, by using na_values
. In this portion of the worksheet, let’s see another way. This will give practice with slicing, with pd.to_numeric
, with apply
, and with lambda functions.
Import the Spotify dataset as
df
but do not provide thena_values
keyword argument.Define
first_col = "Danceability"
andlast_col = "Liveness"
.Get the sub-DataFrame of
df
consisting of the columns fromfirst_col
tolast_col
(includinglast_col
). Get this sub-DataFrame usingloc
and a slice fromfirst_col
tolast_col
. (Hint 1. Make sure you’re slicing the columns axis, not the rows axis. Hint 2. Unlike most slicing in Python, if you slice usingloc
, the right endpoint is included.) Name the resulting DataFramedf_sub
. Apply.copy()
so thatdf_sub
is a brand new DataFrame.Check your answer. The
shape
ofdf_sub
should be(1556, 6)
and all thedtypes
indf_sub
should be “object”.
Get the third column (in the Python numbering) of
df_sub
usingiloc
and name the results
(for Series).Check the
type
ofs
. Make sure it’s really a Series.Try applying the pandas
to_numeric
function tos
. What happens?Try again, using the
errors
keyword argument (documentation) so that the non-numeric strings get converted to not-a-number.Name the result
s2
.Check your answer: the
type
ofs2
should be a Series and thedtype
ofs2
should be “float64”.
Try evaluating
df_sub.apply(pd.to_numeric, axis=0)
. What goes wrong?Using the same approach and a lambda function, try the same thing where you also coerce non-numeric strings into not-a-number. (In other words, use a mix of this
apply
strategy with the aboveerrors=
strategy.)Call the result
df_sub2
.Check your answer. All the
dtypes
indf_sub2
should be “float64”.
Once
df_sub2
is correct, put the values fromdf_sub2
back into the original DataFramedf
. (This should be relatively short. You shouldn’t have to use a for loop or anything.)Check your answer. Try evaluating
df.dtypes["Liveness"]
. You should see that this column is “float64”. (Side question. What type of Python object isdf.dtypes
?)
Drop the rows that contain missing values from
df
.Check your answer: the resulting
df
should be 1545 rows by 23 columns.
Clustering¶
Divide the songs in the Spotify dataset into 5 distinct clusters by using a
KMeans
object from scikit-learn. Fit theKMeans
object using the data from columnsfirst_col:last_col
.Add a “cluster” column to “df” containing the results of this clustering operation.
Plotting the results¶
Evaluate the following code. How much of the code can you understand? (Some of it is new, like the
bind="legend"
and thealt.condition
.)Try clicking on the legend. (For some reason, clicking on clusters 1-4 works fine for me, but clicking on cluster 0 only works for me if I click between the color and the number 0.)
Notice that the tooltip does not seem to work on the bar chart. That is a bug in Altair related to interactivity. One workaround for the bug is to add a meaningless
selection
object to the second chart. Get the tooltip to work by adding the code.add_selection(alt.selection_single())
right after thetransform_filter(sel)
part of the code (i.e., add thisselection
to the definition of thec2
chart).Which cluster contains the most songs? Which cluster contains the fewest songs? (Warning: If you repeat this experiment, it will probably be a different cluster number which is the smallest.)
import altair as alt
sel = alt.selection_single(fields=["cluster"], bind="legend")
c1 = alt.Chart(df).mark_circle().encode(
x = "Energy",
y = "Liveness",
color = "cluster:N",
opacity = alt.condition(sel, alt.value(1), alt.value(0.1)),
tooltip = ["Song Name", "Artist"]
).add_selection(sel)
c2 = alt.Chart(df).mark_bar().encode(
y = alt.Y("count()", scale=alt.Scale(domain=(0,1600))),
tooltip = "count()"
).transform_filter(sel)
c1|c2