Week 5, Thursday Discussion

Week 5, Thursday Discussion

Reminders:

  • Homework #4 due Tuesday 11:59pm

  • Quiz #4 Tuesday during last 20 minutes of discussion

Today:

  • Work on Homework #4

  • Try this worksheet, if done with homework already

Data cleaning

We already know a good way to import the Spotify dataset, by using na_values. In this portion of the worksheet, let’s see another way. This will give practice with slicing, with pd.to_numeric, with apply, and with lambda functions.

  • Import the Spotify dataset as df but do not provide the na_values keyword argument.

  • Define first_col = "Danceability" and last_col = "Liveness".

  • Get the sub-DataFrame of df consisting of the columns from first_col to last_col (including last_col). Get this sub-DataFrame using loc and a slice from first_col to last_col. (Hint 1. Make sure you’re slicing the columns axis, not the rows axis. Hint 2. Unlike most slicing in Python, if you slice using loc, the right endpoint is included.) Name the resulting DataFrame df_sub. Apply .copy() so that df_sub is a brand new DataFrame.

  • Check your answer. The shape of df_sub should be (1556, 6) and all the dtypes in df_sub should be “object”.

  • Get the third column (in the Python numbering) of df_sub using iloc and name the result s (for Series).

  • Check the type of s. Make sure it’s really a Series.

  • Try applying the pandas to_numeric function to s. What happens?

  • Try again, using the errors keyword argument (documentation) so that the non-numeric strings get converted to not-a-number.

  • Name the result s2.

  • Check your answer: the type of s2 should be a Series and the dtype of s2 should be “float64”.

  • Try evaluating df_sub.apply(pd.to_numeric, axis=0). What goes wrong?

  • Using the same approach and a lambda function, try the same thing where you also coerce non-numeric strings into not-a-number. (In other words, use a mix of this apply strategy with the above errors= strategy.)

  • Call the result df_sub2.

  • Check your answer. All the dtypes in df_sub2 should be “float64”.

  • Once df_sub2 is correct, put the values from df_sub2 back into the original DataFrame df. (This should be relatively short. You shouldn’t have to use a for loop or anything.)

  • Check your answer. Try evaluating df.dtypes["Liveness"]. You should see that this column is “float64”. (Side question. What type of Python object is df.dtypes?)

  • Drop the rows that contain missing values from df.

  • Check your answer: the resulting df should be 1545 rows by 23 columns.

Clustering

  • Divide the songs in the Spotify dataset into 5 distinct clusters by using a KMeans object from scikit-learn. Fit the KMeans object using the data from columns first_col:last_col.

  • Add a “cluster” column to “df” containing the results of this clustering operation.

Plotting the results

  • Evaluate the following code. How much of the code can you understand? (Some of it is new, like the bind="legend" and the alt.condition.)

  • Try clicking on the legend. (For some reason, clicking on clusters 1-4 works fine for me, but clicking on cluster 0 only works for me if I click between the color and the number 0.)

  • Notice that the tooltip does not seem to work on the bar chart. That is a bug in Altair related to interactivity. One workaround for the bug is to add a meaningless selection object to the second chart. Get the tooltip to work by adding the code .add_selection(alt.selection_single()) right after the transform_filter(sel) part of the code (i.e., add this selection to the definition of the c2 chart).

  • Which cluster contains the most songs? Which cluster contains the fewest songs? (Warning: If you repeat this experiment, it will probably be a different cluster number which is the smallest.)

import altair as alt
sel = alt.selection_single(fields=["cluster"], bind="legend")

c1 = alt.Chart(df).mark_circle().encode(
    x = "Energy",
    y = "Liveness",
    color = "cluster:N",
    opacity = alt.condition(sel, alt.value(1), alt.value(0.1)),
    tooltip = ["Song Name", "Artist"]
).add_selection(sel)

c2 = alt.Chart(df).mark_bar().encode(
    y = alt.Y("count()", scale=alt.Scale(domain=(0,1600))),
    tooltip = "count()"
).transform_filter(sel)

c1|c2