K-Nearest Neighbors Regressor

Before the recording, we introduced the K-Nearest Neighbors Classifier and the K-Nearest Neighbors Regressor. We mentioned that larger K corresponds to smaller variance (so over-fitting is more likely to occur with smaller values of K). We also discussed the training error curve and test error curve, like from the figures in Chapter 2 of Introduction to Statistical Learning.

import numpy as np
import pandas as pd
import seaborn as sns
import altair as alt
from sklearn.neighbors import KNeighborsRegressor, KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error
df = sns.load_dataset("penguins")
#df = df.dropna()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 333 entries, 0 to 343
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   species            333 non-null    object 
 1   island             333 non-null    object 
 2   bill_length_mm     333 non-null    float64
 3   bill_depth_mm      333 non-null    float64
 4   flipper_length_mm  333 non-null    float64
 5   body_mass_g        333 non-null    float64
 6   sex                333 non-null    object 
dtypes: float64(4), object(3)
memory usage: 20.8+ KB

It would be better to rescale the data first (i.e., to normalize the data). We’ll talk about that soon but we’re skipping it for now.

X_train, X_test, y_train, y_test = train_test_split(
    df[["bill_length_mm","bill_depth_mm","flipper_length_mm"]], df["body_mass_g"], test_size = 0.5)
(166, 3)

The syntax for performing K-Nearest Neighbors regression using scikit-learn is essentially the same as the syntax for performing linear regression.

reg = KNeighborsRegressor(n_neighbors=10)
reg.fit(X_train, y_train)
array([4980. , 3792.5, 3600. , 5062.5, 3592.5, 4027.5, 4980. , 4012.5,
       3792.5, 5522.5, 3607.5, 3380. , 5162.5, 3892.5, 5477.5, 5585. ,
       5022.5, 5562.5, 3557.5, 3912.5, 3265. , 3462.5, 5067.5, 5422.5,
       5092.5, 5585. , 3467.5, 4052.5, 3952.5, 3697.5, 3755. , 3647.5,
       5492.5, 3395. , 5457.5, 4075. , 3590. , 3422.5, 3655. , 4520. ,
       4935. , 4030. , 3222.5, 4490. , 3575. , 3455. , 3562.5, 5472.5,
       4130. , 3825. , 5107.5, 4692.5, 3367.5, 5052.5, 4027.5, 3500. ,
       3842.5, 4760. , 5585. , 4115. , 5512.5, 3522.5, 3667.5, 4052.5,
       5625. , 4755. , 4062.5, 4030. , 4570. , 5207.5, 3225. , 3460. ,
       3267.5, 3630. , 3305. , 3842.5, 5585. , 3515. , 4490. , 3447.5,
       3600. , 5562.5, 3612.5, 3787.5, 4017.5, 3525. , 3727.5, 3630. ,
       4705. , 4130. , 4112.5, 3480. , 4012.5, 3485. , 4992.5, 4057.5,
       4857.5, 3232.5, 3675. , 3870. , 4042.5, 3635. , 3670. , 4957.5,
       5427.5, 3912.5, 5235. , 4677.5, 4595. , 3780. , 5000. , 4130. ,
       3725. , 3357.5, 3442.5, 5000. , 3705. , 4012.5, 3607.5, 3440. ,
       5397.5, 3805. , 3467.5, 4080. , 3655. , 3367.5, 3625. , 3730. ,
       4550. , 3652.5, 3812.5, 5017.5, 4115. , 4705. , 3582.5, 3710. ,
       5557.5, 5305. , 5585. , 3522.5, 3607.5, 4010. , 4755. , 5585. ,
       3892.5, 3575. , 3982.5, 5555. , 4475. , 3467.5, 3750. , 3562.5,
       3862.5, 3892.5, 5147.5, 3707.5, 3675. , 4475. , 3672.5, 3720. ,
       5625. , 5537.5, 3390. , 3655. , 3585. , 5297.5, 3805. ])
(167, 3)
mean_absolute_error(reg.predict(X_test), y_test)
mean_absolute_error(reg.predict(X_train), y_train)

The above numbers are similar, with reg performing just slightly better on the training data. That suggests that for this training set, we are not overfitting the data when using K=10.

def get_scores(k):
    reg = KNeighborsRegressor(n_neighbors=k)
    reg.fit(X_train, y_train)
    train_error = mean_absolute_error(reg.predict(X_train), y_train)
    test_error = mean_absolute_error(reg.predict(X_test), y_test)
    return (train_error, test_error)
(254.23192771084337, 282.6646706586826)
(0.0, 338.92215568862275)
df_scores = pd.DataFrame({"k":range(1,150),"train_error":np.nan,"test_error":np.nan})
k train_error test_error
0 1 NaN NaN
1 2 NaN NaN
2 3 NaN NaN
3 4 NaN NaN
4 5 NaN NaN
... ... ... ...
144 145 NaN NaN
145 146 NaN NaN
146 147 NaN NaN
147 148 NaN NaN
148 149 NaN NaN

149 rows × 3 columns

df_scores.loc[0,["train_error","test_error"]] = get_scores(1)
k train_error test_error
0 1 0.0 338.922156
1 2 NaN NaN
2 3 NaN NaN
3 4 NaN NaN
4 5 NaN NaN

We often avoid using for loops in Math 10, but I couldn’t find a better way to fill in this data. Let me know if you see a more Pythonic approach!

for i in df_scores.index:
    df_scores.loc[i,["train_error","test_error"]] = get_scores(df_scores.loc[i,"k"])
k train_error test_error
0 1 0.000000 338.922156
1 2 189.533133 290.943114
2 3 209.236948 284.780439
3 4 225.828313 280.688623
4 5 240.993976 281.976048
... ... ... ...
144 145 599.882634 547.684287
145 146 605.592920 552.209622
146 147 609.482829 556.251782
147 148 613.268276 559.163093
148 149 616.836136 562.464333

149 rows × 3 columns

Usually when we plot a test error curve, we want higher flexibility (= higher variance) on the right. Since higher values of K correspond to lower flexibility, we are going to add a column to the DataFrame containing the reciprocals of the K values.

df_scores["kinv"] = 1/df_scores.k
ctrain = alt.Chart(df_scores).mark_line().encode(
    x = "kinv",
    y = "train_error"
ctest = alt.Chart(df_scores).mark_line(color="orange").encode(
    x = "kinv",
    y = "test_error"

The blue curve is the training error, while the orange curve is the test error. Notice how underfitting occurs for very high values of K and notice how overfitting occurs for smaller values of K.
