K-Nearest Neighbors Regressor¶

YuJa recording

Before the recording, we introduced the K-Nearest Neighbors Classifier and the K-Nearest Neighbors Regressor. We mentioned that larger K corresponds to smaller variance (so over-fitting is more likely to occur with smaller values of K). We also discussed the training error curve and test error curve, like from the figures in Chapter 2 of Introduction to Statistical Learning.

import numpy as np
import pandas as pd
import seaborn as sns
import altair as alt
from sklearn.neighbors import KNeighborsRegressor, KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error

df = sns.load_dataset("penguins")
#df = df.dropna()
df.dropna(inplace=True)

df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 333 entries, 0 to 343
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   species            333 non-null    object 
 1   island             333 non-null    object 
 2   bill_length_mm     333 non-null    float64
 3   bill_depth_mm      333 non-null    float64
 4   flipper_length_mm  333 non-null    float64
 5   body_mass_g        333 non-null    float64
 6   sex                333 non-null    object 
dtypes: float64(4), object(3)
memory usage: 20.8+ KB

It would be better to rescale the data first (i.e., to normalize the data). We’ll talk about that soon but we’re skipping it for now.

X_train, X_test, y_train, y_test = train_test_split(
    df[["bill_length_mm","bill_depth_mm","flipper_length_mm"]], df["body_mass_g"], test_size = 0.5)

X_train.shape

(166, 3)

The syntax for performing K-Nearest Neighbors regression using scikit-learn is essentially the same as the syntax for performing linear regression.

reg = KNeighborsRegressor(n_neighbors=10)

reg.fit(X_train, y_train)

KNeighborsRegressor(n_neighbors=10)

reg.predict(X_test)

array([4980. , 3792.5, 3600. , 5062.5, 3592.5, 4027.5, 4980. , 4012.5,
       3792.5, 5522.5, 3607.5, 3380. , 5162.5, 3892.5, 5477.5, 5585. ,
       5022.5, 5562.5, 3557.5, 3912.5, 3265. , 3462.5, 5067.5, 5422.5,
       5092.5, 5585. , 3467.5, 4052.5, 3952.5, 3697.5, 3755. , 3647.5,
       5492.5, 3395. , 5457.5, 4075. , 3590. , 3422.5, 3655. , 4520. ,
       4935. , 4030. , 3222.5, 4490. , 3575. , 3455. , 3562.5, 5472.5,
       4130. , 3825. , 5107.5, 4692.5, 3367.5, 5052.5, 4027.5, 3500. ,
       3842.5, 4760. , 5585. , 4115. , 5512.5, 3522.5, 3667.5, 4052.5,
       5625. , 4755. , 4062.5, 4030. , 4570. , 5207.5, 3225. , 3460. ,
       3267.5, 3630. , 3305. , 3842.5, 5585. , 3515. , 4490. , 3447.5,
       3600. , 5562.5, 3612.5, 3787.5, 4017.5, 3525. , 3727.5, 3630. ,
       4705. , 4130. , 4112.5, 3480. , 4012.5, 3485. , 4992.5, 4057.5,
       4857.5, 3232.5, 3675. , 3870. , 4042.5, 3635. , 3670. , 4957.5,
       5427.5, 3912.5, 5235. , 4677.5, 4595. , 3780. , 5000. , 4130. ,
       3725. , 3357.5, 3442.5, 5000. , 3705. , 4012.5, 3607.5, 3440. ,
       5397.5, 3805. , 3467.5, 4080. , 3655. , 3367.5, 3625. , 3730. ,
       4550. , 3652.5, 3812.5, 5017.5, 4115. , 4705. , 3582.5, 3710. ,
       5557.5, 5305. , 5585. , 3522.5, 3607.5, 4010. , 4755. , 5585. ,
       3892.5, 3575. , 3982.5, 5555. , 4475. , 3467.5, 3750. , 3562.5,
       3862.5, 3892.5, 5147.5, 3707.5, 3675. , 4475. , 3672.5, 3720. ,
       5625. , 5537.5, 3390. , 3655. , 3585. , 5297.5, 3805. ])

X_test.shape

(167, 3)

mean_absolute_error(reg.predict(X_test), y_test)

282.6646706586826

mean_absolute_error(reg.predict(X_train), y_train)

254.23192771084337

The above numbers are similar, with reg performing just slightly better on the training data. That suggests that for this training set, we are not overfitting the data when using K=10.

def get_scores(k):
    reg = KNeighborsRegressor(n_neighbors=k)
    reg.fit(X_train, y_train)
    train_error = mean_absolute_error(reg.predict(X_train), y_train)
    test_error = mean_absolute_error(reg.predict(X_test), y_test)
    return (train_error, test_error)

get_scores(10)

(254.23192771084337, 282.6646706586826)

get_scores(1)

(0.0, 338.92215568862275)

df_scores = pd.DataFrame({"k":range(1,150),"train_error":np.nan,"test_error":np.nan})

df_scores

	k	train_error	test_error
0	1	NaN	NaN
1	2	NaN	NaN
2	3	NaN	NaN
3	4	NaN	NaN
4	5	NaN	NaN
...	...	...	...
144	145	NaN	NaN
145	146	NaN	NaN
146	147	NaN	NaN
147	148	NaN	NaN
148	149	NaN	NaN

149 rows × 3 columns

df_scores.loc[0,["train_error","test_error"]] = get_scores(1)

df_scores.head()

	k	train_error	test_error
0	1	0.0	338.922156
1	2	NaN	NaN
2	3	NaN	NaN
3	4	NaN	NaN
4	5	NaN	NaN

We often avoid using for loops in Math 10, but I couldn’t find a better way to fill in this data. Let me know if you see a more Pythonic approach!

for i in df_scores.index:
    df_scores.loc[i,["train_error","test_error"]] = get_scores(df_scores.loc[i,"k"])

df_scores

	k	train_error	test_error
0	1	0.000000	338.922156
1	2	189.533133	290.943114
2	3	209.236948	284.780439
3	4	225.828313	280.688623
4	5	240.993976	281.976048
...	...	...	...
144	145	599.882634	547.684287
145	146	605.592920	552.209622
146	147	609.482829	556.251782
147	148	613.268276	559.163093
148	149	616.836136	562.464333

149 rows × 3 columns

Usually when we plot a test error curve, we want higher flexibility (= higher variance) on the right. Since higher values of K correspond to lower flexibility, we are going to add a column to the DataFrame containing the reciprocals of the K values.

df_scores["kinv"] = 1/df_scores.k

ctrain = alt.Chart(df_scores).mark_line().encode(
    x = "kinv",
    y = "train_error"
)

ctest = alt.Chart(df_scores).mark_line(color="orange").encode(
    x = "kinv",
    y = "test_error"
)

The blue curve is the training error, while the orange curve is the test error. Notice how underfitting occurs for very high values of K and notice how overfitting occurs for smaller values of K.

ctrain+ctest

	k	train_error	test_error
0	1	NaN	NaN
1	2	NaN	NaN
2	3	NaN	NaN
3	4	NaN	NaN
4	5	NaN	NaN
...	...	...	...
144	145	NaN	NaN
145	146	NaN	NaN
146	147	NaN	NaN
147	148	NaN	NaN
148	149	NaN	NaN

	k	train_error	test_error
0	1	NaN	NaN
1	2	NaN	NaN
2	3	NaN	NaN
3	4	NaN	NaN
4	5	NaN	NaN
...	...	...	...
144	145	NaN	NaN
145	146	NaN	NaN
146	147	NaN	NaN
147	148	NaN	NaN
148	149	NaN	NaN

UC Irvine Math 10 W22

K-Nearest Neighbors Regressor

K-Nearest Neighbors Regressor¶

	k	train_error	test_error
0	1	NaN	NaN
1	2	NaN	NaN
2	3	NaN	NaN
3	4	NaN	NaN
4	5	NaN	NaN
...	...	...	...
144	145	NaN	NaN
145	146	NaN	NaN
146	147	NaN	NaN
147	148	NaN	NaN
148	149	NaN	NaN