K-Nearest Neighbors Regressor

K-Nearest Neighbors Regressor

YuJa recording

Before the recording, we introduced the K-Nearest Neighbors Classifier and the K-Nearest Neighbors Regressor. We mentioned that larger K corresponds to smaller variance (so over-fitting is more likely to occur with smaller values of K). We also discussed the training error curve and test error curve, like from the figures in Chapter 2 of Introduction to Statistical Learning.

import numpy as np
import pandas as pd
import seaborn as sns
import altair as alt
from sklearn.neighbors import KNeighborsRegressor, KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error
df = sns.load_dataset("penguins")
#df = df.dropna()
df.dropna(inplace=True)
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 333 entries, 0 to 343
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   species            333 non-null    object 
 1   island             333 non-null    object 
 2   bill_length_mm     333 non-null    float64
 3   bill_depth_mm      333 non-null    float64
 4   flipper_length_mm  333 non-null    float64
 5   body_mass_g        333 non-null    float64
 6   sex                333 non-null    object 
dtypes: float64(4), object(3)
memory usage: 20.8+ KB

It would be better to rescale the data first (i.e., to normalize the data). We’ll talk about that soon but we’re skipping it for now.

X_train, X_test, y_train, y_test = train_test_split(
    df[["bill_length_mm","bill_depth_mm","flipper_length_mm"]], df["body_mass_g"], test_size = 0.5)
X_train.shape
(166, 3)

The syntax for performing K-Nearest Neighbors regression using scikit-learn is essentially the same as the syntax for performing linear regression.

reg = KNeighborsRegressor(n_neighbors=10)
reg.fit(X_train, y_train)
KNeighborsRegressor(n_neighbors=10)
reg.predict(X_test)
array([4980. , 3792.5, 3600. , 5062.5, 3592.5, 4027.5, 4980. , 4012.5,
       3792.5, 5522.5, 3607.5, 3380. , 5162.5, 3892.5, 5477.5, 5585. ,
       5022.5, 5562.5, 3557.5, 3912.5, 3265. , 3462.5, 5067.5, 5422.5,
       5092.5, 5585. , 3467.5, 4052.5, 3952.5, 3697.5, 3755. , 3647.5,
       5492.5, 3395. , 5457.5, 4075. , 3590. , 3422.5, 3655. , 4520. ,
       4935. , 4030. , 3222.5, 4490. , 3575. , 3455. , 3562.5, 5472.5,
       4130. , 3825. , 5107.5, 4692.5, 3367.5, 5052.5, 4027.5, 3500. ,
       3842.5, 4760. , 5585. , 4115. , 5512.5, 3522.5, 3667.5, 4052.5,
       5625. , 4755. , 4062.5, 4030. , 4570. , 5207.5, 3225. , 3460. ,
       3267.5, 3630. , 3305. , 3842.5, 5585. , 3515. , 4490. , 3447.5,
       3600. , 5562.5, 3612.5, 3787.5, 4017.5, 3525. , 3727.5, 3630. ,
       4705. , 4130. , 4112.5, 3480. , 4012.5, 3485. , 4992.5, 4057.5,
       4857.5, 3232.5, 3675. , 3870. , 4042.5, 3635. , 3670. , 4957.5,
       5427.5, 3912.5, 5235. , 4677.5, 4595. , 3780. , 5000. , 4130. ,
       3725. , 3357.5, 3442.5, 5000. , 3705. , 4012.5, 3607.5, 3440. ,
       5397.5, 3805. , 3467.5, 4080. , 3655. , 3367.5, 3625. , 3730. ,
       4550. , 3652.5, 3812.5, 5017.5, 4115. , 4705. , 3582.5, 3710. ,
       5557.5, 5305. , 5585. , 3522.5, 3607.5, 4010. , 4755. , 5585. ,
       3892.5, 3575. , 3982.5, 5555. , 4475. , 3467.5, 3750. , 3562.5,
       3862.5, 3892.5, 5147.5, 3707.5, 3675. , 4475. , 3672.5, 3720. ,
       5625. , 5537.5, 3390. , 3655. , 3585. , 5297.5, 3805. ])
X_test.shape
(167, 3)
mean_absolute_error(reg.predict(X_test), y_test)
282.6646706586826
mean_absolute_error(reg.predict(X_train), y_train)
254.23192771084337

The above numbers are similar, with reg performing just slightly better on the training data. That suggests that for this training set, we are not overfitting the data when using K=10.

def get_scores(k):
    reg = KNeighborsRegressor(n_neighbors=k)
    reg.fit(X_train, y_train)
    train_error = mean_absolute_error(reg.predict(X_train), y_train)
    test_error = mean_absolute_error(reg.predict(X_test), y_test)
    return (train_error, test_error)
get_scores(10)
(254.23192771084337, 282.6646706586826)
get_scores(1)
(0.0, 338.92215568862275)
df_scores = pd.DataFrame({"k":range(1,150),"train_error":np.nan,"test_error":np.nan})
df_scores
k train_error test_error
0 1 NaN NaN
1 2 NaN NaN
2 3 NaN NaN
3 4 NaN NaN
4 5 NaN NaN
... ... ... ...
144 145 NaN NaN
145 146 NaN NaN
146 147 NaN NaN
147 148 NaN NaN
148 149 NaN NaN

149 rows × 3 columns

df_scores.loc[0,["train_error","test_error"]] = get_scores(1)
df_scores.head()
k train_error test_error
0 1 0.0 338.922156
1 2 NaN NaN
2 3 NaN NaN
3 4 NaN NaN
4 5 NaN NaN

We often avoid using for loops in Math 10, but I couldn’t find a better way to fill in this data. Let me know if you see a more Pythonic approach!

for i in df_scores.index:
    df_scores.loc[i,["train_error","test_error"]] = get_scores(df_scores.loc[i,"k"])
df_scores
k train_error test_error
0 1 0.000000 338.922156
1 2 189.533133 290.943114
2 3 209.236948 284.780439
3 4 225.828313 280.688623
4 5 240.993976 281.976048
... ... ... ...
144 145 599.882634 547.684287
145 146 605.592920 552.209622
146 147 609.482829 556.251782
147 148 613.268276 559.163093
148 149 616.836136 562.464333

149 rows × 3 columns

Usually when we plot a test error curve, we want higher flexibility (= higher variance) on the right. Since higher values of K correspond to lower flexibility, we are going to add a column to the DataFrame containing the reciprocals of the K values.

df_scores["kinv"] = 1/df_scores.k
ctrain = alt.Chart(df_scores).mark_line().encode(
    x = "kinv",
    y = "train_error"
)
ctest = alt.Chart(df_scores).mark_line(color="orange").encode(
    x = "kinv",
    y = "test_error"
)

The blue curve is the training error, while the orange curve is the test error. Notice how underfitting occurs for very high values of K and notice how overfitting occurs for smaller values of K.

ctrain+ctest