Worksheet 14#
You are encouraged to work in groups of up to 3 total students, but each student should make their own submission on Canvas. (It’s fine for everyone in the group to have the same upload.)
Put the full names of everyone in your group (even if you’re working alone) here. (This makes grading easier.)
Names:
Introduction#
Our goal in this worksheet is to use Logistic Regression to predict the artist of a song in the Spotify dataset using the “Loudness” and the “Valence” of that song.
Load the attached Spotify dataset, specifying that missing values correspond to an empty space
" "
using thena_values
keyword argument toread_csv
.Drop the rows with missing values. Name the result
df_pre
.
Check your answer: there should be
1545
rows, and many of the columns (for example, “Loudness” and “Valence”) should have numeric data types automatically. (You can use thedtypes
attribute ofdf_pre
.)
The Series method isin
#
Figure out how the following code is working, especially the
isin
method. (You don’t have to write anything, but you need to useisin
in the next part. This is also a good sample quiz or midterm question.)
df_temp = pd.DataFrame([[2, "A"], [3, "B"], [4, "A"], [2, "C"], [2, "B"]])
df_temp[df_temp[1].isin(["A", "C"])]
The most frequently occurring artists#
Define
df
to be the sub-DataFrame ofdf_pre
containing only songs by the 7 most common artists in the dataset. (Hints. You can usedf_pre["Artist"].value_counts()
, and then the first7
values in theindex
will be the most common artists. Now use theisin
method as above.)
Check your answer: the DataFrame
df
should have232
rows and23
columns.
Logistic regression for predicting artists#
Define a Logistic Regression classifier to
df
using scikit-learn, and fit it using “Loudness” and “Valence” for the input features, and using “Artist” for the target.
One of the seven artists is never predicted. Which one? (Find a programmatic way to find this artist… don’t just look at the artists and write out the answer.)
Using
clf.score
, what proportion of artist predictions are correct?
How does that number compare to 1/7, which is the proportion we would expect to get correct by random guessing.
How does that number compare to always guessing the most commonly occuring artist?
Use the coef_
attribute to answer the following question.
As “Loudness” increases, is the song more or less likely to be a Justin Bieber song, according to our model? What about “Valence”? Answer in a markdown cell (not a Python comment.)
What artist is predicted by our logistic regression classifier for a song with loudness
-4
and valence0.6
?
Using the
predict_proba
method, what is the confidence of our classifier in this prediction? You should use theclasses_
attribute to see what entries correspond to what artists.
Make an Altair bar chart (use
mark_bar
) with these seven artist names along the x-axis, and with the heights of the bars given by the probabilities from thepredict_proba
method (still for this song with loudness-4
and valence0.6
.)
What three artists does our classifier think are most likely for this (imaginary) song?
Illustrating the decision boundary#
Using
default_rng
from NumPy, make a 5000x2 NumPy arrayarr
of random real numbers between 0 and 1.Use one of your student id numbers as a
seed
when you instantiate the random number generator so you get consistent results.
Convert
arr
to a pandas DataFramedf_art
(“art” is for “artificial”), and name the columns “Loudness” and “Valence”.
Rescale the “Loudness” column in
df_art
so that its values range from approximately-15
to-3
. The point of this is to approximately match the values in the original “Loudness” column. (I am just picturing multiplying this column by a value, and then adding or subtracting a constant.)
Add a new column called “pred” (for “prediction”) to
df_art
, corresponding to the predicted artist values, according to our logistic regression classifier.
Make an Altair scatter plot from
df_art
with “Loudness” along the x-axis, with “Valence” along the y-axis, and with the points colored using the prediction column.
Recall our question above about whether a song is more or less likely to be a Justin Bieber song as the “Valence” increases. Does your answer seem believable in terms of the above chart?
Look for the area of the chart corresponding to our loudness
-4
and valence0.6
values from the previous section. Recall that we found the three most likely artists for those values. Does this chart seem to agree with those three most likely artists? (In other words, does the way this chart looks seem to agree with the three artists we found above usingpredict_proba
?) Answer in a markdown cell (not as a Python comment.)
In terms of our scatter plot, do you see why the
LogisticRegression
class lives in thelinear_model
library in scikit-learn? What seems “linear” about this plot? Use the phrase “decision boundary” in your answer.
Submission#
Reminder: everyone needs to make a submission on Canvas.
Reminder: include everyone’s full name at the top, after Names.
Using the
Share
button at the top right, enable public sharing, and enable Comment privileges. Then submit the created link on Canvas.