Week 8 Friday¶

Code from Sample Midterm 2, Question 1¶

import pandas as pd
import altair as alt
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
from sklearn.metrics import log_loss

df = pd.read_csv("../data/spotify_dataset.csv", na_values=" ")
df.dropna(inplace=True)
df = df[(df["Artist"] == "Taylor Swift")|(df["Artist"] == "Billie Eilish")]
df = df[df["Artist"].isin(["Taylor Swift", "Billie Eilish"])]

alt.Chart(df).mark_circle().encode(
    x="Tempo",
    y="Energy",
    color="Artist"
)

1a¶

Rewrite the Taylor Swift/Billie Eilish line so that it uses isin.

We can replace

df = df[(df["Artist"] == "Taylor Swift")|(df["Artist"] == "Billie Eilish")]

by

df = df[df["Artist"].isin(["Taylor Swift", "Billie Eilish"])]

import pandas as pd
import altair as alt
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
from sklearn.metrics import log_loss

df = pd.read_csv("../data/spotify_dataset.csv", na_values=" ")
df.dropna(inplace=True)
df = df[df["Artist"].isin(["Taylor Swift", "Billie Eilish"])]

alt.Chart(df).mark_circle().encode(
    x="Tempo",
    y="Energy",
    color="Artist"
)

When do we need to use index?¶

For some similar problems using isin, we have also had to use index. Why didn’t we in this case?

Here is a typical example where we have used index:

Find the sub-DataFrame containing only the 5 most frequent artists.

As a preliminary step, we look at the top 5 rows of value_counts().

df = pd.read_csv("../data/spotify_dataset.csv", na_values=" ")
df.dropna(inplace=True)
df["Artist"].value_counts()[:5]

Taylor Swift     52
Justin Bieber    32
Lil Uzi Vert     32
Juice WRLD       30
Pop Smoke        29
Name: Artist, dtype: int64

What if we want to replace the ["Taylor Swift", "Billie Eilish"] portion above with this list of 5 artists?

Directly converting this Series into a list does not work, because it yields the values, whereas we want the keys.

list(df["Artist"].value_counts()[:5])

[52, 32, 32, 30, 29]

We can access the keys (or maybe I should call it index) as follows.

df["Artist"].value_counts()[:5].index

Index(['Taylor Swift', 'Justin Bieber', 'Lil Uzi Vert', 'Juice WRLD',
       'Pop Smoke'],
      dtype='object')

You could also go in the other order, first extracting the index, and then taking the first 5 entries.

df["Artist"].value_counts().index[:5]

Index(['Taylor Swift', 'Justin Bieber', 'Lil Uzi Vert', 'Juice WRLD',
       'Pop Smoke'],
      dtype='object')

Here we use isin together with index.

df = pd.read_csv("../data/spotify_dataset.csv", na_values=" ")
df.dropna(inplace=True)
df = df[df["Artist"].isin(df["Artist"].value_counts()[:5].index)]

df.head(5)

	Index	Highest Charting Position	Number of Times Charted	Week of Highest Charting	Song Name	Streams	Artist	Artist Followers	Song ID	Genre	...	Danceability	Energy	Loudness	Speechiness	Acousticness	Liveness	Tempo	Duration (ms)	Valence	Chord
13	14	1	19	2021-04-02--2021-04-09	Peaches (feat. Daniel Caesar & Giveon)	20,294,457	Justin Bieber	48504126.0	4iJyoBOLtHqaGxP12qzhQI	['canadian pop', 'pop', 'post-teen pop']	...	0.677	0.696	-6.181	0.1190	0.3210	0.420	90.030	198082.0	0.464	C
118	119	5	21	2021-03-19--2021-03-26	Hold On	6,300,416	Justin Bieber	1250353.0	49xx65gvlD7xXjDTavFqaJ	[]	...	0.658	0.634	-5.797	0.0413	0.0106	0.132	139.980	170813.0	0.290	C#/Db
157	158	6	47	2020-11-06--2020-11-13	What You Know Bout Love	5,570,735	Pop Smoke	6837946.0	1tkg4EHVoqnhR6iFEXb60y	['brooklyn drill']	...	0.709	0.548	-8.493	0.3530	0.6500	0.133	83.995	160000.0	0.543	A#/Bb
161	162	31	83	2019-12-27--2020-01-03	Lucid Dreams	5,477,563	Juice WRLD	19085118.0	285pBltuF7vW8TeWk8hdRR	['chicago rap', 'melodic rap']	...	0.511	0.566	-7.230	0.2000	0.3490	0.340	83.903	239836.0	0.218	F#/Gb
166	167	7	56	2020-10-02--2020-10-09	For The Night (feat. Lil Baby & DaBaby)	5,431,375	Pop Smoke	6837946.0	0PvFJmanyNQMseIFrU708S	['brooklyn drill']	...	0.823	0.586	-6.606	0.2000	0.1140	0.193	125.971	190476.0	0.347	F#/Gb

5 rows × 23 columns

Only the 5 most frequent artists remain. (There is actually a tie for 5th, so your results may be different.)

df["Artist"].value_counts()

Taylor Swift     52
Justin Bieber    32
Lil Uzi Vert     32
Juice WRLD       30
Pop Smoke        29
Name: Artist, dtype: int64

Back to the smaller DataFrame¶

We changed df in the previous part, so we get back to the result of 1a.

import pandas as pd
import altair as alt
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
from sklearn.metrics import log_loss

df = pd.read_csv("../data/spotify_dataset.csv", na_values=" ")
df.dropna(inplace=True)
df = df[df["Artist"].isin(["Taylor Swift", "Billie Eilish"])]

alt.Chart(df).mark_circle().encode(
    x="Tempo",
    y="Energy",
    color="Artist"
)

1b¶

Rescale the “Tempo” and “Energy” data using StandardScaler. Overwrite the current columns in df using the rescaled data.

scaler = StandardScaler()

scaler.fit(df[["Tempo","Energy"]]) # Just the Tempo and Energy columns

StandardScaler()

df[["Tempo","Energy"]] = scaler.transform(df[["Tempo","Energy"]])

Notice how the “Tempo” and “Energy” columns have been rescaled.

df.head(10)

	Index	Highest Charting Position	Number of Times Charted	Week of Highest Charting	Song Name	Streams	Artist	Artist Followers	Song ID	Genre	...	Danceability	Energy	Loudness	Speechiness	Acousticness	Liveness	Tempo	Duration (ms)	Valence	Chord
52	53	19	3	2021-07-09--2021-07-16	NDA	9,635,619	Billie Eilish	47014200.0	38GBNKZUhfBkk3oNlWzRYd	['electropop', 'pop']	...	0.765	-0.585706	-9.921	0.0713	0.341	0.1120	-1.246922	195777.0	0.554	G#/Ab
110	111	46	83	2020-04-17--2020-04-24	lovely (with Khalid)	6,569,547	Billie Eilish	47014200.0	0u2P5u6lvoDfwTYjAADbn4	['electropop', 'pop']	...	0.351	-0.994373	-10.109	0.0333	0.934	0.0950	-0.223263	200186.0	0.120	E
164	165	13	83	2020-01-24--2020-01-31	bad guy	5,436,286	Billie Eilish	1250353.0	1hewNsVmijBqjKvFRQfk4m	[]	...	0.701	-0.309723	-10.965	0.3750	0.328	0.1000	0.447858	194088.0	0.562	G
215	216	15	7	2021-06-04--2021-06-11	Lost Cause	5,203,319	Billie Eilish	1250353.0	36fhyo5QlxBW7DrDHIOn1G	[]	...	0.671	-0.782078	-8.494	0.2410	0.705	0.0577	-1.583936	212496.0	0.518	A#/Bb
245	246	4	10	2021-04-30--2021-05-07	Your Power	5,135,380	Billie Eilish	47014200.0	042Sl6Mn83JHyLEqdK7uI0	['electropop', 'pop']	...	0.632	-1.058061	-14.025	0.0801	0.932	0.2330	0.262323	245897.0	0.208	A
308	309	2	31	2020-11-13--2020-11-20	Therefore I Am	5,265,630	Billie Eilish	1250353.0	0aACcc1jEv2C6VSmmJbllJ	[]	...	0.889	-0.760849	-7.773	0.0697	0.218	0.0550	-0.942780	174321.0	0.716	B
377	378	10	69	2019-12-27--2020-01-03	everything i wanted	5,448,680	Billie Eilish	1250353.0	1OC8hMoqAuWmTWJ4heJNlO	[]	...	0.704	-1.371196	-14.454	0.0994	0.901	0.1060	-0.063565	245426.0	0.243	F#/Gb
398	399	20	5	2021-04-09--2021-04-16	Mr. Perfectly Fine (Taylor’s Version) (From Th...	5,480,722	Taylor Swift	42227614.0	2CYVETnhM9aytqrazYYwrK	['pop', 'post-teen pop']	...	0.660	1.770765	-6.269	0.0521	0.162	0.0667	0.475388	277592.0	0.714	B
399	400	42	60	2020-01-31--2020-02-07	when the party's over	5,467,596	Billie Eilish	1250353.0	5KWwUzVxubgSPmLIUAQhYt	[]	...	0.484	-2.018694	-14.075	0.0627	0.979	0.0895	0.071883	199931.0	0.194	C#/Db
421	422	8	5	2021-02-12--2021-02-19	Love Story (Taylor’s Version)	6,202,933	Taylor Swift	42227614.0	6YvqWjhGD8mB5QXcbcUKtx	['pop', 'post-teen pop']	...	0.627	1.638080	-4.311	0.0310	0.130	0.0845	-0.095762	235767.0	0.415	D

10 rows × 23 columns

Aside: It’s also possible to do the fit and the transform all in a single line. Here we apply both fit and transform to the “Valence” and “Danceability” columns.

scaler.fit_transform(df[["Valence","Danceability"]])

array([[ 8.22295774e-01,  1.49241781e+00],
       [-1.42718156e+00, -1.66200332e+00],
       [ 8.63760794e-01,  1.00477783e+00],
       [ 6.35703184e-01,  7.76196587e-01],
       [-9.71066342e-01,  4.79040973e-01],
       [ 1.66196243e+00,  2.43722028e+00],
       [-7.89656880e-01,  1.02763595e+00],
       [ 1.65159617e+00,  6.92383465e-01],
       [-1.04363013e+00, -6.48626485e-01],
       [ 1.01841051e-01,  4.40944100e-01],
       [ 6.92717586e-01, -1.34960896e+00],
       [ 4.07645574e-01,  4.79040973e-01],
       [ 1.79587964e-01, -9.79633892e-04],
       [-5.51233014e-01, -7.71733811e-02],
       [ 3.97279319e-01, -1.27341521e+00],
       [ 2.31419239e-01,  1.51407860e-01],
       [ 2.08179576e+00,  1.66766343e+00],
       [-1.52132197e-01, -1.11340834e+00],
       [ 1.43908795e+00,  2.19982233e-01],
       [-3.85372934e-01, -4.81000241e-01],
       [ 5.83871909e-01,  1.43788486e-01],
       [ 7.54915116e-01,  6.75947386e-02],
       [-8.00023135e-01, -4.81000241e-01],
       [ 5.00097759e-02, -4.66958822e-02],
       [-1.37535029e+00,  3.03795355e-01],
       [-1.88329678e+00, -1.72295832e+00],
       [-3.90556062e-01, -8.16252729e-01],
       [ 7.23816351e-01, -1.76225252e-01],
       [-2.45428492e-01,  4.47366144e-02],
       [-2.19512854e-01, -4.35283993e-01],
       [ 2.15869856e-01,  2.19982233e-01],
       [-9.03685685e-01, -1.92868144e+00],
       [-9.55516960e-01, -7.71733811e-02],
       [ 7.23816351e-01, -4.12425869e-01],
       [-2.14329727e-01,  2.73317856e-01],
       [-5.35683632e-01, -2.44799625e-01],
       [-3.90556062e-01, -1.36484771e+00],
       [ 9.20775196e-01, -1.76225252e-01],
       [ 5.99421291e-01,  2.65698481e-01],
       [-5.61599269e-01, -2.60038374e-01],
       [ 2.71932044e+00,  9.13345332e-01],
       [ 8.06746391e-01,  3.34272854e-01],
       [-1.62984185e+00, -8.61968977e-01],
       [-1.26132148e+00, -2.06583018e+00],
       [-1.15765893e+00,  1.58385031e+00],
       [ 3.96435209e-02, -2.82896498e-01],
       [ 1.61013115e+00,  8.90487208e-01],
       [-1.47382971e+00, -8.61968977e-01],
       [ 2.83250514e-01,  6.39047842e-01],
       [-8.47515391e-02, -1.22889629e-01],
       [ 1.01841051e-01,  2.04743483e-01],
       [ 8.62916685e-02, -4.12425869e-01],
       [ 5.52773144e-01,  1.81885359e-01],
       [ 3.35081789e-01, -7.71733811e-02],
       [ 5.73505654e-01,  1.74265985e-01],
       [-1.46346346e+00, -1.63914520e+00],
       [-2.92076639e-01,  5.93331594e-01],
       [ 1.74404836e-01,  7.07622215e-01],
       [-1.21985646e+00, -1.60866770e+00],
       [-1.78118917e+00, -1.44104146e+00],
       [ 1.65159617e+00,  1.53813406e+00],
       [-9.86615725e-01,  2.42198153e+00],
       [ 9.00042686e-01,  1.19526220e+00],
       [-1.42718156e+00, -1.13626647e+00],
       [-5.87514907e-01,  2.16292279e+00],
       [ 1.07108589e+00, -6.95540063e-02],
       [-5.77148652e-01,  8.67629084e-01],
       [ 1.72415996e+00,  3.11414729e-01],
       [-7.68924370e-01, -6.56245860e-01],
       [ 1.38122943e-01, -9.22923975e-01]])

1c¶

We will eventually use K-Nearest Neighbors on these two columns. Why is rescaling them natural?

If you look at the Altair chart and imagine it using the same scale for both the x and y-axes, the chart could get extremely spread out in the x direction, because the x values range from about 60 to 210, whereas the y values only range from about 0 to 1. So if we want to compute distance between these, we should rescale.

(Another possible answer is that the units are not the same.)

1d¶

Our goal is to predict the Artist using the two scaled columns. Divide this data into a training set and a test set using train_test_split.

We’ll save 20% of the data as our test set.

X_train, X_test, y_train, y_test = train_test_split(df[["Tempo","Energy"]], df["Artist"], test_size=0.2)

X_train

	Tempo	Energy
1400	0.525407	-1.870088
921	0.608672	-0.670624
701	-0.871961	0.412079
1195	-1.635140	-1.403040
970	-0.069010	-0.920070
1374	-1.246583	0.995889
667	0.812707	-0.261956
983	-0.103710	-1.604720
942	-1.080494	0.741136
695	-1.117053	0.327161
702	1.014544	1.171032
978	0.952315	-1.121750
215	-1.583936	-0.782078
1428	0.966688	0.268780
691	-1.587284	-0.649394
308	-0.942780	-0.760849
377	-0.063565	-1.371196
696	-0.064918	0.024641
688	0.409777	1.043656
164	0.447858	-0.309723
965	-0.398348	-0.373411
429	-0.736174	1.367405
950	0.269695	-1.169516
435	2.143382	0.178555
608	-1.449874	0.688062
694	0.006543	-0.792693
445	-0.537009	-0.585706
671	-0.332636	-0.113350
399	0.071883	-2.018694
699	1.217835	1.309024
967	0.473494	-0.039047
437	0.884371	0.757058
432	1.323927	1.319639
444	-0.674758	-0.054969
439	-1.071566	0.661525
1424	-0.130461	-0.702468
960	-1.300458	-0.166424
433	0.704991	0.067100
1379	-0.062585	-0.506095
889	-1.565302	-0.548554
424	-1.378954	0.481075
436	0.207805	1.791994
966	-0.911767	0.024641
700	0.205674	0.178555
52	-1.246922	-0.585706
585	1.671832	-1.291585
1466	2.038541	1.839760
441	0.405414	-0.208883
555	0.448738	-1.275663
698	0.111317	-1.132364
428	0.275546	1.537241
991	0.776080	-0.591013
976	0.677427	1.144495
431	-0.893504	0.863205
245	0.262323	-1.058061
421	-0.095762	1.638080

1e¶

Fit either KNeighborsClassifier or KNeighborsRegressor to this data using the training set.

Only KNeighborsClassifier makes sense, because the “Artist” values are categorical, not quantitative/numerical. (For something like MNIST, the values are arguably numerical, but one should still use KNeighborsClassifier in that case, because the values are discrete, and their order is not significant.

Here we use 6 neighbors.

clf = KNeighborsClassifier(n_neighbors=6)

clf.fit(X_train, y_train)

KNeighborsClassifier(n_neighbors=6)

1f¶

Evaluate the performance of the model on the test set using log_loss.

clf.predict(X_test)

array(['Taylor Swift', 'Taylor Swift', 'Taylor Swift', 'Taylor Swift',
       'Billie Eilish', 'Taylor Swift', 'Taylor Swift', 'Taylor Swift',
       'Billie Eilish', 'Taylor Swift', 'Taylor Swift', 'Taylor Swift',
       'Taylor Swift', 'Billie Eilish'], dtype=object)

clf.classes_

array(['Billie Eilish', 'Taylor Swift'], dtype=object)

clf.predict_proba(X_test)

array([[0.16666667, 0.83333333],
       [0.33333333, 0.66666667],
       [0.        , 1.        ],
       [0.33333333, 0.66666667],
       [0.5       , 0.5       ],
       [0.        , 1.        ],
       [0.        , 1.        ],
       [0.        , 1.        ],
       [0.66666667, 0.33333333],
       [0.        , 1.        ],
       [0.        , 1.        ],
       [0.        , 1.        ],
       [0.33333333, 0.66666667],
       [0.5       , 0.5       ]])

log_loss(y_test, clf.predict_proba(X_test), labels=['Billie Eilish', 'Taylor Swift'])

0.37642270657331045

This log_loss value by itself doesn’t mean much, but if we make a change to the model and evaluate it again, then it is more meaningful. For example, let’s try using 10 neighbors instead of 6 neighbors.

clf = KNeighborsClassifier(n_neighbors=10)
clf.fit(X_train, y_train)
log_loss(y_test, clf.predict_proba(X_test), labels=['Billie Eilish', 'Taylor Swift'])

0.44423507473761276

Because the loss score decreased, this evidence suggests for this particular data, K-Nearest Neighbors performs better with 10 neighbors than with 6 neighbors.

1g¶

Change the Altair chart so that it uses the predicted Artist class, not the actual Artist.

Here are a few examples with different numbers of neighbors. Notice how the model appears to become less flexible (i.e., appears to have more bias) as the number of neighbors increases.

alt.Chart(df).mark_circle().encode(
    x="Tempo",
    y="Energy",
    color="Artist"
).properties(
    title="Original"
)

clf = KNeighborsClassifier(n_neighbors=1)
clf.fit(X_train, y_train)
df["pred"] = clf.predict(df[["Tempo","Energy"]])
alt.Chart(df).mark_circle().encode(
    x="Tempo",
    y="Energy",
    color="pred"
).properties(
    title="n_neighbors = 1"
)

clf = KNeighborsClassifier(n_neighbors=5)
clf.fit(X_train, y_train)
df["pred"] = clf.predict(df[["Tempo","Energy"]])
alt.Chart(df).mark_circle().encode(
    x="Tempo",
    y="Energy",
    color="pred"
).properties(
    title="n_neighbors = 5"
)

clf = KNeighborsClassifier(n_neighbors=12)
clf.fit(X_train, y_train)
df["pred"] = clf.predict(df[["Tempo","Energy"]])
alt.Chart(df).mark_circle().encode(
    x="Tempo",
    y="Energy",
    color="pred"
).properties(
    title="n_neighbors = 12"
)

clf = KNeighborsClassifier(n_neighbors=20)
clf.fit(X_train, y_train)
df["pred"] = clf.predict(df[["Tempo","Energy"]])
alt.Chart(df).mark_circle().encode(
    x="Tempo",
    y="Energy",
    color="pred"
).properties(
    title="n_neighbors = 20"
)

UC Irvine Math 10 W22

Week 8 Friday

Contents

Week 8 Friday¶

Code from Sample Midterm 2, Question 1¶

1a¶

When do we need to use index?¶

Back to the smaller DataFrame¶

1b¶

1c¶

1d¶

1e¶

1f¶

1g¶