Worksheet 14

Worksheet 14#

You are encouraged to work in groups of up to 3 total students, but each student should make their own submission on Canvas. (It’s fine for everyone in the group to have the same upload.)

Put the full names of everyone in your group (even if you’re working alone) here. (This makes grading easier.)

Names:

Introduction#

Our goal in this worksheet is to use Logistic Regression to predict the artist of a song in the Spotify dataset using the “Loudness” and the “Valence” of that song.

Load the attached Spotify dataset, specifying that missing values correspond to an empty space " " using the na_values keyword argument to read_csv.
Drop the rows with missing values. Name the result df_pre.

Check your answer: there should be 1545 rows, and many of the columns (for example, “Loudness” and “Valence”) should have numeric data types automatically. (You can use the dtypes attribute of df_pre.)

The Series method `isin`#

Figure out how the following code is working, especially the isin method. (You don’t have to write anything, but you need to use isin in the next part. This is also a good sample quiz or midterm question.)

df_temp = pd.DataFrame([[2, "A"], [3, "B"], [4, "A"], [2, "C"], [2, "B"]])
df_temp[df_temp[1].isin(["A", "C"])]

The most frequently occurring artists#

Define df to be the sub-DataFrame of df_pre containing only songs by the 7 most common artists in the dataset. (Hints. You can use df_pre["Artist"].value_counts(), and then the first 7 values in the index will be the most common artists. Now use the isin method as above.)

Check your answer: the DataFrame df should have 232 rows and 23 columns.

Logistic regression for predicting artists#

Define a Logistic Regression classifier to df using scikit-learn, and fit it using “Loudness” and “Valence” for the input features, and using “Artist” for the target.

One of the seven artists is never predicted. Which one? (Find a programmatic way to find this artist… don’t just look at the artists and write out the answer.)

Using clf.score, what proportion of artist predictions are correct?

How does that number compare to 1/7, which is the proportion we would expect to get correct by random guessing.

How does that number compare to always guessing the most commonly occuring artist?

Use the coef_ attribute to answer the following question.

As “Loudness” increases, is the song more or less likely to be a Justin Bieber song, according to our model? What about “Valence”? Answer in a markdown cell (not a Python comment.)

What artist is predicted by our logistic regression classifier for a song with loudness -4 and valence 0.6?

Using the predict_proba method, what is the confidence of our classifier in this prediction? You should use the classes_ attribute to see what entries correspond to what artists.

Make an Altair bar chart (use mark_bar) with these seven artist names along the x-axis, and with the heights of the bars given by the probabilities from the predict_proba method (still for this song with loudness -4 and valence 0.6.)

What three artists does our classifier think are most likely for this (imaginary) song?

Illustrating the decision boundary#

Using default_rng from NumPy, make a 5000x2 NumPy array arr of random real numbers between 0 and 1.
Use one of your student id numbers as a seed when you instantiate the random number generator so you get consistent results.

Convert arr to a pandas DataFrame df_art (“art” is for “artificial”), and name the columns “Loudness” and “Valence”.

Rescale the “Loudness” column in df_art so that its values range from approximately -15 to -3. The point of this is to approximately match the values in the original “Loudness” column. (I am just picturing multiplying this column by a value, and then adding or subtracting a constant.)

Add a new column called “pred” (for “prediction”) to df_art, corresponding to the predicted artist values, according to our logistic regression classifier.

Make an Altair scatter plot from df_art with “Loudness” along the x-axis, with “Valence” along the y-axis, and with the points colored using the prediction column.

Recall our question above about whether a song is more or less likely to be a Justin Bieber song as the “Valence” increases. Does your answer seem believable in terms of the above chart?

Look for the area of the chart corresponding to our loudness -4 and valence 0.6 values from the previous section. Recall that we found the three most likely artists for those values. Does this chart seem to agree with those three most likely artists? (In other words, does the way this chart looks seem to agree with the three artists we found above using predict_proba?) Answer in a markdown cell (not as a Python comment.)

In terms of our scatter plot, do you see why the LogisticRegression class lives in the linear_model library in scikit-learn? What seems “linear” about this plot? Use the phrase “decision boundary” in your answer.

Submission#

Reminder: everyone needs to make a submission on Canvas.
Reminder: include everyone’s full name at the top, after Names.
Using the Share button at the top right, enable public sharing, and enable Comment privileges. Then submit the created link on Canvas.