Week 8 Friday
Contents
Week 8 Friday¶
Code from Sample Midterm 2, Question 1¶
import pandas as pd
import altair as alt
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
from sklearn.metrics import log_loss
df = pd.read_csv("../data/spotify_dataset.csv", na_values=" ")
df.dropna(inplace=True)
df = df[(df["Artist"] == "Taylor Swift")|(df["Artist"] == "Billie Eilish")]
df = df[df["Artist"].isin(["Taylor Swift", "Billie Eilish"])]
alt.Chart(df).mark_circle().encode(
x="Tempo",
y="Energy",
color="Artist"
)
1a¶
Rewrite the Taylor Swift/Billie Eilish line so that it uses isin
.
We can replace
df = df[(df["Artist"] == "Taylor Swift")|(df["Artist"] == "Billie Eilish")]
by
df = df[df["Artist"].isin(["Taylor Swift", "Billie Eilish"])]
import pandas as pd
import altair as alt
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
from sklearn.metrics import log_loss
df = pd.read_csv("../data/spotify_dataset.csv", na_values=" ")
df.dropna(inplace=True)
df = df[df["Artist"].isin(["Taylor Swift", "Billie Eilish"])]
alt.Chart(df).mark_circle().encode(
x="Tempo",
y="Energy",
color="Artist"
)
When do we need to use index?¶
For some similar problems using isin
, we have also had to use index
. Why didn’t we in this case?
Here is a typical example where we have used index:
Find the sub-DataFrame containing only the 5 most frequent artists.
As a preliminary step, we look at the top 5 rows of value_counts()
.
df = pd.read_csv("../data/spotify_dataset.csv", na_values=" ")
df.dropna(inplace=True)
df["Artist"].value_counts()[:5]
Taylor Swift 52
Justin Bieber 32
Lil Uzi Vert 32
Juice WRLD 30
Pop Smoke 29
Name: Artist, dtype: int64
What if we want to replace the ["Taylor Swift", "Billie Eilish"]
portion above with this list of 5 artists?
Directly converting this Series into a list does not work, because it yields the values, whereas we want the keys.
list(df["Artist"].value_counts()[:5])
[52, 32, 32, 30, 29]
We can access the keys (or maybe I should call it index) as follows.
df["Artist"].value_counts()[:5].index
Index(['Taylor Swift', 'Justin Bieber', 'Lil Uzi Vert', 'Juice WRLD',
'Pop Smoke'],
dtype='object')
You could also go in the other order, first extracting the index, and then taking the first 5 entries.
df["Artist"].value_counts().index[:5]
Index(['Taylor Swift', 'Justin Bieber', 'Lil Uzi Vert', 'Juice WRLD',
'Pop Smoke'],
dtype='object')
Here we use isin
together with index
.
df = pd.read_csv("../data/spotify_dataset.csv", na_values=" ")
df.dropna(inplace=True)
df = df[df["Artist"].isin(df["Artist"].value_counts()[:5].index)]
df.head(5)
Index | Highest Charting Position | Number of Times Charted | Week of Highest Charting | Song Name | Streams | Artist | Artist Followers | Song ID | Genre | ... | Danceability | Energy | Loudness | Speechiness | Acousticness | Liveness | Tempo | Duration (ms) | Valence | Chord | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
13 | 14 | 1 | 19 | 2021-04-02--2021-04-09 | Peaches (feat. Daniel Caesar & Giveon) | 20,294,457 | Justin Bieber | 48504126.0 | 4iJyoBOLtHqaGxP12qzhQI | ['canadian pop', 'pop', 'post-teen pop'] | ... | 0.677 | 0.696 | -6.181 | 0.1190 | 0.3210 | 0.420 | 90.030 | 198082.0 | 0.464 | C |
118 | 119 | 5 | 21 | 2021-03-19--2021-03-26 | Hold On | 6,300,416 | Justin Bieber | 1250353.0 | 49xx65gvlD7xXjDTavFqaJ | [] | ... | 0.658 | 0.634 | -5.797 | 0.0413 | 0.0106 | 0.132 | 139.980 | 170813.0 | 0.290 | C#/Db |
157 | 158 | 6 | 47 | 2020-11-06--2020-11-13 | What You Know Bout Love | 5,570,735 | Pop Smoke | 6837946.0 | 1tkg4EHVoqnhR6iFEXb60y | ['brooklyn drill'] | ... | 0.709 | 0.548 | -8.493 | 0.3530 | 0.6500 | 0.133 | 83.995 | 160000.0 | 0.543 | A#/Bb |
161 | 162 | 31 | 83 | 2019-12-27--2020-01-03 | Lucid Dreams | 5,477,563 | Juice WRLD | 19085118.0 | 285pBltuF7vW8TeWk8hdRR | ['chicago rap', 'melodic rap'] | ... | 0.511 | 0.566 | -7.230 | 0.2000 | 0.3490 | 0.340 | 83.903 | 239836.0 | 0.218 | F#/Gb |
166 | 167 | 7 | 56 | 2020-10-02--2020-10-09 | For The Night (feat. Lil Baby & DaBaby) | 5,431,375 | Pop Smoke | 6837946.0 | 0PvFJmanyNQMseIFrU708S | ['brooklyn drill'] | ... | 0.823 | 0.586 | -6.606 | 0.2000 | 0.1140 | 0.193 | 125.971 | 190476.0 | 0.347 | F#/Gb |
5 rows × 23 columns
Only the 5 most frequent artists remain. (There is actually a tie for 5th, so your results may be different.)
df["Artist"].value_counts()
Taylor Swift 52
Justin Bieber 32
Lil Uzi Vert 32
Juice WRLD 30
Pop Smoke 29
Name: Artist, dtype: int64
Back to the smaller DataFrame¶
We changed df
in the previous part, so we get back to the result of 1a.
import pandas as pd
import altair as alt
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
from sklearn.metrics import log_loss
df = pd.read_csv("../data/spotify_dataset.csv", na_values=" ")
df.dropna(inplace=True)
df = df[df["Artist"].isin(["Taylor Swift", "Billie Eilish"])]
alt.Chart(df).mark_circle().encode(
x="Tempo",
y="Energy",
color="Artist"
)
1b¶
Rescale the “Tempo” and “Energy” data using StandardScaler
. Overwrite the current columns in df
using the rescaled data.
scaler = StandardScaler()
scaler.fit(df[["Tempo","Energy"]]) # Just the Tempo and Energy columns
StandardScaler()
df[["Tempo","Energy"]] = scaler.transform(df[["Tempo","Energy"]])
Notice how the “Tempo” and “Energy” columns have been rescaled.
df.head(10)
Index | Highest Charting Position | Number of Times Charted | Week of Highest Charting | Song Name | Streams | Artist | Artist Followers | Song ID | Genre | ... | Danceability | Energy | Loudness | Speechiness | Acousticness | Liveness | Tempo | Duration (ms) | Valence | Chord | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
52 | 53 | 19 | 3 | 2021-07-09--2021-07-16 | NDA | 9,635,619 | Billie Eilish | 47014200.0 | 38GBNKZUhfBkk3oNlWzRYd | ['electropop', 'pop'] | ... | 0.765 | -0.585706 | -9.921 | 0.0713 | 0.341 | 0.1120 | -1.246922 | 195777.0 | 0.554 | G#/Ab |
110 | 111 | 46 | 83 | 2020-04-17--2020-04-24 | lovely (with Khalid) | 6,569,547 | Billie Eilish | 47014200.0 | 0u2P5u6lvoDfwTYjAADbn4 | ['electropop', 'pop'] | ... | 0.351 | -0.994373 | -10.109 | 0.0333 | 0.934 | 0.0950 | -0.223263 | 200186.0 | 0.120 | E |
164 | 165 | 13 | 83 | 2020-01-24--2020-01-31 | bad guy | 5,436,286 | Billie Eilish | 1250353.0 | 1hewNsVmijBqjKvFRQfk4m | [] | ... | 0.701 | -0.309723 | -10.965 | 0.3750 | 0.328 | 0.1000 | 0.447858 | 194088.0 | 0.562 | G |
215 | 216 | 15 | 7 | 2021-06-04--2021-06-11 | Lost Cause | 5,203,319 | Billie Eilish | 1250353.0 | 36fhyo5QlxBW7DrDHIOn1G | [] | ... | 0.671 | -0.782078 | -8.494 | 0.2410 | 0.705 | 0.0577 | -1.583936 | 212496.0 | 0.518 | A#/Bb |
245 | 246 | 4 | 10 | 2021-04-30--2021-05-07 | Your Power | 5,135,380 | Billie Eilish | 47014200.0 | 042Sl6Mn83JHyLEqdK7uI0 | ['electropop', 'pop'] | ... | 0.632 | -1.058061 | -14.025 | 0.0801 | 0.932 | 0.2330 | 0.262323 | 245897.0 | 0.208 | A |
308 | 309 | 2 | 31 | 2020-11-13--2020-11-20 | Therefore I Am | 5,265,630 | Billie Eilish | 1250353.0 | 0aACcc1jEv2C6VSmmJbllJ | [] | ... | 0.889 | -0.760849 | -7.773 | 0.0697 | 0.218 | 0.0550 | -0.942780 | 174321.0 | 0.716 | B |
377 | 378 | 10 | 69 | 2019-12-27--2020-01-03 | everything i wanted | 5,448,680 | Billie Eilish | 1250353.0 | 1OC8hMoqAuWmTWJ4heJNlO | [] | ... | 0.704 | -1.371196 | -14.454 | 0.0994 | 0.901 | 0.1060 | -0.063565 | 245426.0 | 0.243 | F#/Gb |
398 | 399 | 20 | 5 | 2021-04-09--2021-04-16 | Mr. Perfectly Fine (Taylor’s Version) (From Th... | 5,480,722 | Taylor Swift | 42227614.0 | 2CYVETnhM9aytqrazYYwrK | ['pop', 'post-teen pop'] | ... | 0.660 | 1.770765 | -6.269 | 0.0521 | 0.162 | 0.0667 | 0.475388 | 277592.0 | 0.714 | B |
399 | 400 | 42 | 60 | 2020-01-31--2020-02-07 | when the party's over | 5,467,596 | Billie Eilish | 1250353.0 | 5KWwUzVxubgSPmLIUAQhYt | [] | ... | 0.484 | -2.018694 | -14.075 | 0.0627 | 0.979 | 0.0895 | 0.071883 | 199931.0 | 0.194 | C#/Db |
421 | 422 | 8 | 5 | 2021-02-12--2021-02-19 | Love Story (Taylor’s Version) | 6,202,933 | Taylor Swift | 42227614.0 | 6YvqWjhGD8mB5QXcbcUKtx | ['pop', 'post-teen pop'] | ... | 0.627 | 1.638080 | -4.311 | 0.0310 | 0.130 | 0.0845 | -0.095762 | 235767.0 | 0.415 | D |
10 rows × 23 columns
Aside: It’s also possible to do the fit
and the transform
all in a single line. Here we apply both fit
and transform
to the “Valence” and “Danceability” columns.
scaler.fit_transform(df[["Valence","Danceability"]])
array([[ 8.22295774e-01, 1.49241781e+00],
[-1.42718156e+00, -1.66200332e+00],
[ 8.63760794e-01, 1.00477783e+00],
[ 6.35703184e-01, 7.76196587e-01],
[-9.71066342e-01, 4.79040973e-01],
[ 1.66196243e+00, 2.43722028e+00],
[-7.89656880e-01, 1.02763595e+00],
[ 1.65159617e+00, 6.92383465e-01],
[-1.04363013e+00, -6.48626485e-01],
[ 1.01841051e-01, 4.40944100e-01],
[ 6.92717586e-01, -1.34960896e+00],
[ 4.07645574e-01, 4.79040973e-01],
[ 1.79587964e-01, -9.79633892e-04],
[-5.51233014e-01, -7.71733811e-02],
[ 3.97279319e-01, -1.27341521e+00],
[ 2.31419239e-01, 1.51407860e-01],
[ 2.08179576e+00, 1.66766343e+00],
[-1.52132197e-01, -1.11340834e+00],
[ 1.43908795e+00, 2.19982233e-01],
[-3.85372934e-01, -4.81000241e-01],
[ 5.83871909e-01, 1.43788486e-01],
[ 7.54915116e-01, 6.75947386e-02],
[-8.00023135e-01, -4.81000241e-01],
[ 5.00097759e-02, -4.66958822e-02],
[-1.37535029e+00, 3.03795355e-01],
[-1.88329678e+00, -1.72295832e+00],
[-3.90556062e-01, -8.16252729e-01],
[ 7.23816351e-01, -1.76225252e-01],
[-2.45428492e-01, 4.47366144e-02],
[-2.19512854e-01, -4.35283993e-01],
[ 2.15869856e-01, 2.19982233e-01],
[-9.03685685e-01, -1.92868144e+00],
[-9.55516960e-01, -7.71733811e-02],
[ 7.23816351e-01, -4.12425869e-01],
[-2.14329727e-01, 2.73317856e-01],
[-5.35683632e-01, -2.44799625e-01],
[-3.90556062e-01, -1.36484771e+00],
[ 9.20775196e-01, -1.76225252e-01],
[ 5.99421291e-01, 2.65698481e-01],
[-5.61599269e-01, -2.60038374e-01],
[ 2.71932044e+00, 9.13345332e-01],
[ 8.06746391e-01, 3.34272854e-01],
[-1.62984185e+00, -8.61968977e-01],
[-1.26132148e+00, -2.06583018e+00],
[-1.15765893e+00, 1.58385031e+00],
[ 3.96435209e-02, -2.82896498e-01],
[ 1.61013115e+00, 8.90487208e-01],
[-1.47382971e+00, -8.61968977e-01],
[ 2.83250514e-01, 6.39047842e-01],
[-8.47515391e-02, -1.22889629e-01],
[ 1.01841051e-01, 2.04743483e-01],
[ 8.62916685e-02, -4.12425869e-01],
[ 5.52773144e-01, 1.81885359e-01],
[ 3.35081789e-01, -7.71733811e-02],
[ 5.73505654e-01, 1.74265985e-01],
[-1.46346346e+00, -1.63914520e+00],
[-2.92076639e-01, 5.93331594e-01],
[ 1.74404836e-01, 7.07622215e-01],
[-1.21985646e+00, -1.60866770e+00],
[-1.78118917e+00, -1.44104146e+00],
[ 1.65159617e+00, 1.53813406e+00],
[-9.86615725e-01, 2.42198153e+00],
[ 9.00042686e-01, 1.19526220e+00],
[-1.42718156e+00, -1.13626647e+00],
[-5.87514907e-01, 2.16292279e+00],
[ 1.07108589e+00, -6.95540063e-02],
[-5.77148652e-01, 8.67629084e-01],
[ 1.72415996e+00, 3.11414729e-01],
[-7.68924370e-01, -6.56245860e-01],
[ 1.38122943e-01, -9.22923975e-01]])
1c¶
We will eventually use K-Nearest Neighbors on these two columns. Why is rescaling them natural?
If you look at the Altair chart and imagine it using the same scale for both the x and y-axes, the chart could get extremely spread out in the x direction, because the x values range from about 60 to 210, whereas the y values only range from about 0 to 1. So if we want to compute distance between these, we should rescale.
(Another possible answer is that the units are not the same.)
1d¶
Our goal is to predict the Artist using the two scaled columns. Divide this data into a training set and a test set using train_test_split
.
We’ll save 20% of the data as our test set.
X_train, X_test, y_train, y_test = train_test_split(df[["Tempo","Energy"]], df["Artist"], test_size=0.2)
X_train
Tempo | Energy | |
---|---|---|
1400 | 0.525407 | -1.870088 |
921 | 0.608672 | -0.670624 |
701 | -0.871961 | 0.412079 |
1195 | -1.635140 | -1.403040 |
970 | -0.069010 | -0.920070 |
1374 | -1.246583 | 0.995889 |
667 | 0.812707 | -0.261956 |
983 | -0.103710 | -1.604720 |
942 | -1.080494 | 0.741136 |
695 | -1.117053 | 0.327161 |
702 | 1.014544 | 1.171032 |
978 | 0.952315 | -1.121750 |
215 | -1.583936 | -0.782078 |
1428 | 0.966688 | 0.268780 |
691 | -1.587284 | -0.649394 |
308 | -0.942780 | -0.760849 |
377 | -0.063565 | -1.371196 |
696 | -0.064918 | 0.024641 |
688 | 0.409777 | 1.043656 |
164 | 0.447858 | -0.309723 |
965 | -0.398348 | -0.373411 |
429 | -0.736174 | 1.367405 |
950 | 0.269695 | -1.169516 |
435 | 2.143382 | 0.178555 |
608 | -1.449874 | 0.688062 |
694 | 0.006543 | -0.792693 |
445 | -0.537009 | -0.585706 |
671 | -0.332636 | -0.113350 |
399 | 0.071883 | -2.018694 |
699 | 1.217835 | 1.309024 |
967 | 0.473494 | -0.039047 |
437 | 0.884371 | 0.757058 |
432 | 1.323927 | 1.319639 |
444 | -0.674758 | -0.054969 |
439 | -1.071566 | 0.661525 |
1424 | -0.130461 | -0.702468 |
960 | -1.300458 | -0.166424 |
433 | 0.704991 | 0.067100 |
1379 | -0.062585 | -0.506095 |
889 | -1.565302 | -0.548554 |
424 | -1.378954 | 0.481075 |
436 | 0.207805 | 1.791994 |
966 | -0.911767 | 0.024641 |
700 | 0.205674 | 0.178555 |
52 | -1.246922 | -0.585706 |
585 | 1.671832 | -1.291585 |
1466 | 2.038541 | 1.839760 |
441 | 0.405414 | -0.208883 |
555 | 0.448738 | -1.275663 |
698 | 0.111317 | -1.132364 |
428 | 0.275546 | 1.537241 |
991 | 0.776080 | -0.591013 |
976 | 0.677427 | 1.144495 |
431 | -0.893504 | 0.863205 |
245 | 0.262323 | -1.058061 |
421 | -0.095762 | 1.638080 |
1e¶
Fit either KNeighborsClassifier
or KNeighborsRegressor
to this data using the training set.
Only KNeighborsClassifier
makes sense, because the “Artist” values are categorical, not quantitative/numerical. (For something like MNIST, the values are arguably numerical, but one should still use KNeighborsClassifier
in that case, because the values are discrete, and their order is not significant.
Here we use 6 neighbors.
clf = KNeighborsClassifier(n_neighbors=6)
clf.fit(X_train, y_train)
KNeighborsClassifier(n_neighbors=6)
1f¶
Evaluate the performance of the model on the test set using log_loss
.
clf.predict(X_test)
array(['Taylor Swift', 'Taylor Swift', 'Taylor Swift', 'Taylor Swift',
'Billie Eilish', 'Taylor Swift', 'Taylor Swift', 'Taylor Swift',
'Billie Eilish', 'Taylor Swift', 'Taylor Swift', 'Taylor Swift',
'Taylor Swift', 'Billie Eilish'], dtype=object)
clf.classes_
array(['Billie Eilish', 'Taylor Swift'], dtype=object)
clf.predict_proba(X_test)
array([[0.16666667, 0.83333333],
[0.33333333, 0.66666667],
[0. , 1. ],
[0.33333333, 0.66666667],
[0.5 , 0.5 ],
[0. , 1. ],
[0. , 1. ],
[0. , 1. ],
[0.66666667, 0.33333333],
[0. , 1. ],
[0. , 1. ],
[0. , 1. ],
[0.33333333, 0.66666667],
[0.5 , 0.5 ]])
log_loss(y_test, clf.predict_proba(X_test), labels=['Billie Eilish', 'Taylor Swift'])
0.37642270657331045
This log_loss
value by itself doesn’t mean much, but if we make a change to the model and evaluate it again, then it is more meaningful. For example, let’s try using 10 neighbors instead of 6 neighbors.
clf = KNeighborsClassifier(n_neighbors=10)
clf.fit(X_train, y_train)
log_loss(y_test, clf.predict_proba(X_test), labels=['Billie Eilish', 'Taylor Swift'])
0.44423507473761276
Because the loss score decreased, this evidence suggests for this particular data, K-Nearest Neighbors performs better with 10 neighbors than with 6 neighbors.
1g¶
Change the Altair chart so that it uses the predicted Artist class, not the actual Artist.
Here are a few examples with different numbers of neighbors. Notice how the model appears to become less flexible (i.e., appears to have more bias) as the number of neighbors increases.
alt.Chart(df).mark_circle().encode(
x="Tempo",
y="Energy",
color="Artist"
).properties(
title="Original"
)
clf = KNeighborsClassifier(n_neighbors=1)
clf.fit(X_train, y_train)
df["pred"] = clf.predict(df[["Tempo","Energy"]])
alt.Chart(df).mark_circle().encode(
x="Tempo",
y="Energy",
color="pred"
).properties(
title="n_neighbors = 1"
)
clf = KNeighborsClassifier(n_neighbors=5)
clf.fit(X_train, y_train)
df["pred"] = clf.predict(df[["Tempo","Energy"]])
alt.Chart(df).mark_circle().encode(
x="Tempo",
y="Energy",
color="pred"
).properties(
title="n_neighbors = 5"
)
clf = KNeighborsClassifier(n_neighbors=12)
clf.fit(X_train, y_train)
df["pred"] = clf.predict(df[["Tempo","Energy"]])
alt.Chart(df).mark_circle().encode(
x="Tempo",
y="Energy",
color="pred"
).properties(
title="n_neighbors = 12"
)
clf = KNeighborsClassifier(n_neighbors=20)
clf.fit(X_train, y_train)
df["pred"] = clf.predict(df[["Tempo","Energy"]])
alt.Chart(df).mark_circle().encode(
x="Tempo",
y="Energy",
color="pred"
).properties(
title="n_neighbors = 20"
)