Week 8 Friday

YuJa recording

Code from Sample Midterm 2, Question 1

import pandas as pd
import altair as alt
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
from sklearn.metrics import log_loss

df = pd.read_csv("../data/spotify_dataset.csv", na_values=" ")
df.dropna(inplace=True)
df = df[(df["Artist"] == "Taylor Swift")|(df["Artist"] == "Billie Eilish")]
df = df[df["Artist"].isin(["Taylor Swift", "Billie Eilish"])]

alt.Chart(df).mark_circle().encode(
    x="Tempo",
    y="Energy",
    color="Artist"
)

1a

Rewrite the Taylor Swift/Billie Eilish line so that it uses isin.

We can replace

df = df[(df["Artist"] == "Taylor Swift")|(df["Artist"] == "Billie Eilish")]

by

df = df[df["Artist"].isin(["Taylor Swift", "Billie Eilish"])]
import pandas as pd
import altair as alt
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
from sklearn.metrics import log_loss

df = pd.read_csv("../data/spotify_dataset.csv", na_values=" ")
df.dropna(inplace=True)
df = df[df["Artist"].isin(["Taylor Swift", "Billie Eilish"])]

alt.Chart(df).mark_circle().encode(
    x="Tempo",
    y="Energy",
    color="Artist"
)

When do we need to use index?

For some similar problems using isin, we have also had to use index. Why didn’t we in this case?

Here is a typical example where we have used index:

  • Find the sub-DataFrame containing only the 5 most frequent artists.

As a preliminary step, we look at the top 5 rows of value_counts().

df = pd.read_csv("../data/spotify_dataset.csv", na_values=" ")
df.dropna(inplace=True)
df["Artist"].value_counts()[:5]
Taylor Swift     52
Justin Bieber    32
Lil Uzi Vert     32
Juice WRLD       30
Pop Smoke        29
Name: Artist, dtype: int64

What if we want to replace the ["Taylor Swift", "Billie Eilish"] portion above with this list of 5 artists?

Directly converting this Series into a list does not work, because it yields the values, whereas we want the keys.

list(df["Artist"].value_counts()[:5])
[52, 32, 32, 30, 29]

We can access the keys (or maybe I should call it index) as follows.

df["Artist"].value_counts()[:5].index
Index(['Taylor Swift', 'Justin Bieber', 'Lil Uzi Vert', 'Juice WRLD',
       'Pop Smoke'],
      dtype='object')

You could also go in the other order, first extracting the index, and then taking the first 5 entries.

df["Artist"].value_counts().index[:5]
Index(['Taylor Swift', 'Justin Bieber', 'Lil Uzi Vert', 'Juice WRLD',
       'Pop Smoke'],
      dtype='object')

Here we use isin together with index.

df = pd.read_csv("../data/spotify_dataset.csv", na_values=" ")
df.dropna(inplace=True)
df = df[df["Artist"].isin(df["Artist"].value_counts()[:5].index)]
df.head(5)
Index Highest Charting Position Number of Times Charted Week of Highest Charting Song Name Streams Artist Artist Followers Song ID Genre ... Danceability Energy Loudness Speechiness Acousticness Liveness Tempo Duration (ms) Valence Chord
13 14 1 19 2021-04-02--2021-04-09 Peaches (feat. Daniel Caesar & Giveon) 20,294,457 Justin Bieber 48504126.0 4iJyoBOLtHqaGxP12qzhQI ['canadian pop', 'pop', 'post-teen pop'] ... 0.677 0.696 -6.181 0.1190 0.3210 0.420 90.030 198082.0 0.464 C
118 119 5 21 2021-03-19--2021-03-26 Hold On 6,300,416 Justin Bieber 1250353.0 49xx65gvlD7xXjDTavFqaJ [] ... 0.658 0.634 -5.797 0.0413 0.0106 0.132 139.980 170813.0 0.290 C#/Db
157 158 6 47 2020-11-06--2020-11-13 What You Know Bout Love 5,570,735 Pop Smoke 6837946.0 1tkg4EHVoqnhR6iFEXb60y ['brooklyn drill'] ... 0.709 0.548 -8.493 0.3530 0.6500 0.133 83.995 160000.0 0.543 A#/Bb
161 162 31 83 2019-12-27--2020-01-03 Lucid Dreams 5,477,563 Juice WRLD 19085118.0 285pBltuF7vW8TeWk8hdRR ['chicago rap', 'melodic rap'] ... 0.511 0.566 -7.230 0.2000 0.3490 0.340 83.903 239836.0 0.218 F#/Gb
166 167 7 56 2020-10-02--2020-10-09 For The Night (feat. Lil Baby & DaBaby) 5,431,375 Pop Smoke 6837946.0 0PvFJmanyNQMseIFrU708S ['brooklyn drill'] ... 0.823 0.586 -6.606 0.2000 0.1140 0.193 125.971 190476.0 0.347 F#/Gb

5 rows × 23 columns

Only the 5 most frequent artists remain. (There is actually a tie for 5th, so your results may be different.)

df["Artist"].value_counts()
Taylor Swift     52
Justin Bieber    32
Lil Uzi Vert     32
Juice WRLD       30
Pop Smoke        29
Name: Artist, dtype: int64

Back to the smaller DataFrame

We changed df in the previous part, so we get back to the result of 1a.

import pandas as pd
import altair as alt
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
from sklearn.metrics import log_loss

df = pd.read_csv("../data/spotify_dataset.csv", na_values=" ")
df.dropna(inplace=True)
df = df[df["Artist"].isin(["Taylor Swift", "Billie Eilish"])]

alt.Chart(df).mark_circle().encode(
    x="Tempo",
    y="Energy",
    color="Artist"
)

1b

Rescale the “Tempo” and “Energy” data using StandardScaler. Overwrite the current columns in df using the rescaled data.

scaler = StandardScaler()
scaler.fit(df[["Tempo","Energy"]]) # Just the Tempo and Energy columns
StandardScaler()
df[["Tempo","Energy"]] = scaler.transform(df[["Tempo","Energy"]])

Notice how the “Tempo” and “Energy” columns have been rescaled.

df.head(10)
Index Highest Charting Position Number of Times Charted Week of Highest Charting Song Name Streams Artist Artist Followers Song ID Genre ... Danceability Energy Loudness Speechiness Acousticness Liveness Tempo Duration (ms) Valence Chord
52 53 19 3 2021-07-09--2021-07-16 NDA 9,635,619 Billie Eilish 47014200.0 38GBNKZUhfBkk3oNlWzRYd ['electropop', 'pop'] ... 0.765 -0.585706 -9.921 0.0713 0.341 0.1120 -1.246922 195777.0 0.554 G#/Ab
110 111 46 83 2020-04-17--2020-04-24 lovely (with Khalid) 6,569,547 Billie Eilish 47014200.0 0u2P5u6lvoDfwTYjAADbn4 ['electropop', 'pop'] ... 0.351 -0.994373 -10.109 0.0333 0.934 0.0950 -0.223263 200186.0 0.120 E
164 165 13 83 2020-01-24--2020-01-31 bad guy 5,436,286 Billie Eilish 1250353.0 1hewNsVmijBqjKvFRQfk4m [] ... 0.701 -0.309723 -10.965 0.3750 0.328 0.1000 0.447858 194088.0 0.562 G
215 216 15 7 2021-06-04--2021-06-11 Lost Cause 5,203,319 Billie Eilish 1250353.0 36fhyo5QlxBW7DrDHIOn1G [] ... 0.671 -0.782078 -8.494 0.2410 0.705 0.0577 -1.583936 212496.0 0.518 A#/Bb
245 246 4 10 2021-04-30--2021-05-07 Your Power 5,135,380 Billie Eilish 47014200.0 042Sl6Mn83JHyLEqdK7uI0 ['electropop', 'pop'] ... 0.632 -1.058061 -14.025 0.0801 0.932 0.2330 0.262323 245897.0 0.208 A
308 309 2 31 2020-11-13--2020-11-20 Therefore I Am 5,265,630 Billie Eilish 1250353.0 0aACcc1jEv2C6VSmmJbllJ [] ... 0.889 -0.760849 -7.773 0.0697 0.218 0.0550 -0.942780 174321.0 0.716 B
377 378 10 69 2019-12-27--2020-01-03 everything i wanted 5,448,680 Billie Eilish 1250353.0 1OC8hMoqAuWmTWJ4heJNlO [] ... 0.704 -1.371196 -14.454 0.0994 0.901 0.1060 -0.063565 245426.0 0.243 F#/Gb
398 399 20 5 2021-04-09--2021-04-16 Mr. Perfectly Fine (Taylor’s Version) (From Th... 5,480,722 Taylor Swift 42227614.0 2CYVETnhM9aytqrazYYwrK ['pop', 'post-teen pop'] ... 0.660 1.770765 -6.269 0.0521 0.162 0.0667 0.475388 277592.0 0.714 B
399 400 42 60 2020-01-31--2020-02-07 when the party's over 5,467,596 Billie Eilish 1250353.0 5KWwUzVxubgSPmLIUAQhYt [] ... 0.484 -2.018694 -14.075 0.0627 0.979 0.0895 0.071883 199931.0 0.194 C#/Db
421 422 8 5 2021-02-12--2021-02-19 Love Story (Taylor’s Version) 6,202,933 Taylor Swift 42227614.0 6YvqWjhGD8mB5QXcbcUKtx ['pop', 'post-teen pop'] ... 0.627 1.638080 -4.311 0.0310 0.130 0.0845 -0.095762 235767.0 0.415 D

10 rows × 23 columns

Aside: It’s also possible to do the fit and the transform all in a single line. Here we apply both fit and transform to the “Valence” and “Danceability” columns.

scaler.fit_transform(df[["Valence","Danceability"]])
array([[ 8.22295774e-01,  1.49241781e+00],
       [-1.42718156e+00, -1.66200332e+00],
       [ 8.63760794e-01,  1.00477783e+00],
       [ 6.35703184e-01,  7.76196587e-01],
       [-9.71066342e-01,  4.79040973e-01],
       [ 1.66196243e+00,  2.43722028e+00],
       [-7.89656880e-01,  1.02763595e+00],
       [ 1.65159617e+00,  6.92383465e-01],
       [-1.04363013e+00, -6.48626485e-01],
       [ 1.01841051e-01,  4.40944100e-01],
       [ 6.92717586e-01, -1.34960896e+00],
       [ 4.07645574e-01,  4.79040973e-01],
       [ 1.79587964e-01, -9.79633892e-04],
       [-5.51233014e-01, -7.71733811e-02],
       [ 3.97279319e-01, -1.27341521e+00],
       [ 2.31419239e-01,  1.51407860e-01],
       [ 2.08179576e+00,  1.66766343e+00],
       [-1.52132197e-01, -1.11340834e+00],
       [ 1.43908795e+00,  2.19982233e-01],
       [-3.85372934e-01, -4.81000241e-01],
       [ 5.83871909e-01,  1.43788486e-01],
       [ 7.54915116e-01,  6.75947386e-02],
       [-8.00023135e-01, -4.81000241e-01],
       [ 5.00097759e-02, -4.66958822e-02],
       [-1.37535029e+00,  3.03795355e-01],
       [-1.88329678e+00, -1.72295832e+00],
       [-3.90556062e-01, -8.16252729e-01],
       [ 7.23816351e-01, -1.76225252e-01],
       [-2.45428492e-01,  4.47366144e-02],
       [-2.19512854e-01, -4.35283993e-01],
       [ 2.15869856e-01,  2.19982233e-01],
       [-9.03685685e-01, -1.92868144e+00],
       [-9.55516960e-01, -7.71733811e-02],
       [ 7.23816351e-01, -4.12425869e-01],
       [-2.14329727e-01,  2.73317856e-01],
       [-5.35683632e-01, -2.44799625e-01],
       [-3.90556062e-01, -1.36484771e+00],
       [ 9.20775196e-01, -1.76225252e-01],
       [ 5.99421291e-01,  2.65698481e-01],
       [-5.61599269e-01, -2.60038374e-01],
       [ 2.71932044e+00,  9.13345332e-01],
       [ 8.06746391e-01,  3.34272854e-01],
       [-1.62984185e+00, -8.61968977e-01],
       [-1.26132148e+00, -2.06583018e+00],
       [-1.15765893e+00,  1.58385031e+00],
       [ 3.96435209e-02, -2.82896498e-01],
       [ 1.61013115e+00,  8.90487208e-01],
       [-1.47382971e+00, -8.61968977e-01],
       [ 2.83250514e-01,  6.39047842e-01],
       [-8.47515391e-02, -1.22889629e-01],
       [ 1.01841051e-01,  2.04743483e-01],
       [ 8.62916685e-02, -4.12425869e-01],
       [ 5.52773144e-01,  1.81885359e-01],
       [ 3.35081789e-01, -7.71733811e-02],
       [ 5.73505654e-01,  1.74265985e-01],
       [-1.46346346e+00, -1.63914520e+00],
       [-2.92076639e-01,  5.93331594e-01],
       [ 1.74404836e-01,  7.07622215e-01],
       [-1.21985646e+00, -1.60866770e+00],
       [-1.78118917e+00, -1.44104146e+00],
       [ 1.65159617e+00,  1.53813406e+00],
       [-9.86615725e-01,  2.42198153e+00],
       [ 9.00042686e-01,  1.19526220e+00],
       [-1.42718156e+00, -1.13626647e+00],
       [-5.87514907e-01,  2.16292279e+00],
       [ 1.07108589e+00, -6.95540063e-02],
       [-5.77148652e-01,  8.67629084e-01],
       [ 1.72415996e+00,  3.11414729e-01],
       [-7.68924370e-01, -6.56245860e-01],
       [ 1.38122943e-01, -9.22923975e-01]])

1c

We will eventually use K-Nearest Neighbors on these two columns. Why is rescaling them natural?

If you look at the Altair chart and imagine it using the same scale for both the x and y-axes, the chart could get extremely spread out in the x direction, because the x values range from about 60 to 210, whereas the y values only range from about 0 to 1. So if we want to compute distance between these, we should rescale.

(Another possible answer is that the units are not the same.)

1d

Our goal is to predict the Artist using the two scaled columns. Divide this data into a training set and a test set using train_test_split.

We’ll save 20% of the data as our test set.

X_train, X_test, y_train, y_test = train_test_split(df[["Tempo","Energy"]], df["Artist"], test_size=0.2)
X_train
Tempo Energy
1400 0.525407 -1.870088
921 0.608672 -0.670624
701 -0.871961 0.412079
1195 -1.635140 -1.403040
970 -0.069010 -0.920070
1374 -1.246583 0.995889
667 0.812707 -0.261956
983 -0.103710 -1.604720
942 -1.080494 0.741136
695 -1.117053 0.327161
702 1.014544 1.171032
978 0.952315 -1.121750
215 -1.583936 -0.782078
1428 0.966688 0.268780
691 -1.587284 -0.649394
308 -0.942780 -0.760849
377 -0.063565 -1.371196
696 -0.064918 0.024641
688 0.409777 1.043656
164 0.447858 -0.309723
965 -0.398348 -0.373411
429 -0.736174 1.367405
950 0.269695 -1.169516
435 2.143382 0.178555
608 -1.449874 0.688062
694 0.006543 -0.792693
445 -0.537009 -0.585706
671 -0.332636 -0.113350
399 0.071883 -2.018694
699 1.217835 1.309024
967 0.473494 -0.039047
437 0.884371 0.757058
432 1.323927 1.319639
444 -0.674758 -0.054969
439 -1.071566 0.661525
1424 -0.130461 -0.702468
960 -1.300458 -0.166424
433 0.704991 0.067100
1379 -0.062585 -0.506095
889 -1.565302 -0.548554
424 -1.378954 0.481075
436 0.207805 1.791994
966 -0.911767 0.024641
700 0.205674 0.178555
52 -1.246922 -0.585706
585 1.671832 -1.291585
1466 2.038541 1.839760
441 0.405414 -0.208883
555 0.448738 -1.275663
698 0.111317 -1.132364
428 0.275546 1.537241
991 0.776080 -0.591013
976 0.677427 1.144495
431 -0.893504 0.863205
245 0.262323 -1.058061
421 -0.095762 1.638080

1e

Fit either KNeighborsClassifier or KNeighborsRegressor to this data using the training set.

Only KNeighborsClassifier makes sense, because the “Artist” values are categorical, not quantitative/numerical. (For something like MNIST, the values are arguably numerical, but one should still use KNeighborsClassifier in that case, because the values are discrete, and their order is not significant.

Here we use 6 neighbors.

clf = KNeighborsClassifier(n_neighbors=6)
clf.fit(X_train, y_train)
KNeighborsClassifier(n_neighbors=6)

1f

Evaluate the performance of the model on the test set using log_loss.

clf.predict(X_test)
array(['Taylor Swift', 'Taylor Swift', 'Taylor Swift', 'Taylor Swift',
       'Billie Eilish', 'Taylor Swift', 'Taylor Swift', 'Taylor Swift',
       'Billie Eilish', 'Taylor Swift', 'Taylor Swift', 'Taylor Swift',
       'Taylor Swift', 'Billie Eilish'], dtype=object)
clf.classes_
array(['Billie Eilish', 'Taylor Swift'], dtype=object)
clf.predict_proba(X_test)
array([[0.16666667, 0.83333333],
       [0.33333333, 0.66666667],
       [0.        , 1.        ],
       [0.33333333, 0.66666667],
       [0.5       , 0.5       ],
       [0.        , 1.        ],
       [0.        , 1.        ],
       [0.        , 1.        ],
       [0.66666667, 0.33333333],
       [0.        , 1.        ],
       [0.        , 1.        ],
       [0.        , 1.        ],
       [0.33333333, 0.66666667],
       [0.5       , 0.5       ]])
log_loss(y_test, clf.predict_proba(X_test), labels=['Billie Eilish', 'Taylor Swift'])
0.37642270657331045

This log_loss value by itself doesn’t mean much, but if we make a change to the model and evaluate it again, then it is more meaningful. For example, let’s try using 10 neighbors instead of 6 neighbors.

clf = KNeighborsClassifier(n_neighbors=10)
clf.fit(X_train, y_train)
log_loss(y_test, clf.predict_proba(X_test), labels=['Billie Eilish', 'Taylor Swift'])
0.44423507473761276

Because the loss score decreased, this evidence suggests for this particular data, K-Nearest Neighbors performs better with 10 neighbors than with 6 neighbors.

1g

Change the Altair chart so that it uses the predicted Artist class, not the actual Artist.

Here are a few examples with different numbers of neighbors. Notice how the model appears to become less flexible (i.e., appears to have more bias) as the number of neighbors increases.

alt.Chart(df).mark_circle().encode(
    x="Tempo",
    y="Energy",
    color="Artist"
).properties(
    title="Original"
)
clf = KNeighborsClassifier(n_neighbors=1)
clf.fit(X_train, y_train)
df["pred"] = clf.predict(df[["Tempo","Energy"]])
alt.Chart(df).mark_circle().encode(
    x="Tempo",
    y="Energy",
    color="pred"
).properties(
    title="n_neighbors = 1"
)
clf = KNeighborsClassifier(n_neighbors=5)
clf.fit(X_train, y_train)
df["pred"] = clf.predict(df[["Tempo","Energy"]])
alt.Chart(df).mark_circle().encode(
    x="Tempo",
    y="Energy",
    color="pred"
).properties(
    title="n_neighbors = 5"
)
clf = KNeighborsClassifier(n_neighbors=12)
clf.fit(X_train, y_train)
df["pred"] = clf.predict(df[["Tempo","Energy"]])
alt.Chart(df).mark_circle().encode(
    x="Tempo",
    y="Energy",
    color="pred"
).properties(
    title="n_neighbors = 12"
)
clf = KNeighborsClassifier(n_neighbors=20)
clf.fit(X_train, y_train)
df["pred"] = clf.predict(df[["Tempo","Energy"]])
alt.Chart(df).mark_circle().encode(
    x="Tempo",
    y="Energy",
    color="pred"
).properties(
    title="n_neighbors = 20"
)