# K-Nearest Neighbors Regressor

[YuJa recording](https://uci.yuja.com/V/Video?v=4348961&node=14654381&a=1301700135&autoplay=1)

Before the recording, we introduced the K-Nearest Neighbors Classifier and the K-Nearest Neighbors Regressor.  We mentioned that larger K corresponds to smaller variance (so over-fitting is more likely to occur with smaller values of K).  We also discussed the training error curve and test error curve, like from the figures in Chapter 2 of *Introduction to Statistical Learning*.

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import altair as alt
from sklearn.neighbors import KNeighborsRegressor, KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error

In [2]:
df = sns.load_dataset("penguins")
#df = df.dropna()
df.dropna(inplace=True)

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 333 entries, 0 to 343
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   species            333 non-null    object 
 1   island             333 non-null    object 
 2   bill_length_mm     333 non-null    float64
 3   bill_depth_mm      333 non-null    float64
 4   flipper_length_mm  333 non-null    float64
 5   body_mass_g        333 non-null    float64
 6   sex                333 non-null    object 
dtypes: float64(4), object(3)
memory usage: 20.8+ KB


It would be better to rescale the data first (i.e., to normalize the data).  We'll talk about that soon but we're skipping it for now.

In [4]:
X_train, X_test, y_train, y_test = train_test_split(
    df[["bill_length_mm","bill_depth_mm","flipper_length_mm"]], df["body_mass_g"], test_size = 0.5)

In [5]:
X_train.shape

(166, 3)

The syntax for performing K-Nearest Neighbors regression using scikit-learn is essentially the same as the syntax for performing linear regression.

In [6]:
reg = KNeighborsRegressor(n_neighbors=10)

In [7]:
reg.fit(X_train, y_train)

KNeighborsRegressor(n_neighbors=10)

In [8]:
reg.predict(X_test)

array([4652.5, 5000. , 3460. , 3835. , 3347.5, 4785. , 4757.5, 5505. ,
       4072.5, 3572.5, 4582.5, 4490. , 3690. , 3295. , 5530. , 3560. ,
       3605. , 3600. , 4525. , 3395. , 5490. , 3600. , 3475. , 3540. ,
       4460. , 5580. , 5322.5, 3520. , 3927.5, 4987.5, 4862.5, 5257.5,
       3632.5, 4200. , 3987.5, 5580. , 3915. , 3660. , 3335. , 3720. ,
       4812.5, 3912.5, 4862.5, 3342.5, 5580. , 3517.5, 4040. , 5450. ,
       3605. , 3347.5, 3927.5, 4020. , 4982.5, 3865. , 4060. , 4067.5,
       3295. , 3865. , 5440. , 3435. , 3607.5, 3712.5, 3522.5, 3517.5,
       3830. , 3720. , 4835. , 3477.5, 4795. , 3960. , 3637.5, 4995. ,
       4640. , 5460. , 4010. , 3777.5, 3582.5, 3565. , 4465. , 3705. ,
       3622.5, 3767.5, 3517.5, 3742.5, 3347.5, 5530. , 4550. , 3875. ,
       4952.5, 3837.5, 3530. , 5630. , 4907.5, 3670. , 5595. , 5580. ,
       3892.5, 4360. , 4507.5, 5132.5, 3830. , 4622.5, 3912.5, 3295. ,
       3840. , 3555. , 3745. , 4987.5, 3690. , 5147.5, 3547.5, 3595. ,
      

In [9]:
X_test.shape

(167, 3)

In [10]:
mean_absolute_error(reg.predict(X_test), y_test)

272.45508982035926

In [11]:
mean_absolute_error(reg.predict(X_train), y_train)

261.0843373493976

The above numbers are similar, with `reg` performing just slightly better on the training data.  That suggests that for this training set, we are not overfitting the data when using K=10.

In [12]:
def get_scores(k):
    reg = KNeighborsRegressor(n_neighbors=k)
    reg.fit(X_train, y_train)
    train_error = mean_absolute_error(reg.predict(X_train), y_train)
    test_error = mean_absolute_error(reg.predict(X_test), y_test)
    return (train_error, test_error)

In [13]:
get_scores(10)

(261.0843373493976, 272.45508982035926)

In [14]:
get_scores(1)

(0.0, 372.90419161676647)

In [15]:
df_scores = pd.DataFrame({"k":range(1,150),"train_error":np.nan,"test_error":np.nan})

In [16]:
df_scores

Unnamed: 0,k,train_error,test_error
0,1,,
1,2,,
2,3,,
3,4,,
4,5,,
...,...,...,...
144,145,,
145,146,,
146,147,,
147,148,,


In [17]:
df_scores.loc[0,["train_error","test_error"]] = get_scores(1)

In [18]:
df_scores.head()

Unnamed: 0,k,train_error,test_error
0,1,0.0,372.904192
1,2,,
2,3,,
3,4,,
4,5,,


We often avoid using `for` loops in Math 10, but I couldn't find a better way to fill in this data.  Let me know if you see a more Pythonic approach!

In [24]:
for i in df_scores.index:
    df_scores.loc[i,["train_error","test_error"]] = get_scores(df_scores.loc[i,"k"])

In [25]:
df_scores

Unnamed: 0,k,train_error,test_error,kinv
0,1,0.000000,372.904192,1.000000
1,2,177.484940,306.961078,0.500000
2,3,208.935743,293.562874,0.333333
3,4,228.765060,286.190120,0.250000
4,5,231.987952,279.550898,0.200000
...,...,...,...,...
144,145,527.361861,615.610159,0.006897
145,146,531.186046,620.454639,0.006849
146,147,535.337267,624.538678,0.006803
147,148,540.600578,629.153180,0.006757


Usually when we plot a test error curve, we want higher flexibility (= higher variance) on the right.  Since higher values of K correspond to lower flexibility, we are going to add a column to the DataFrame containing the reciprocals of the K values.

In [26]:
df_scores["kinv"] = 1/df_scores.k

In [27]:
ctrain = alt.Chart(df_scores).mark_line().encode(
    x = "kinv",
    y = "train_error"
)

In [28]:
ctest = alt.Chart(df_scores).mark_line(color="orange").encode(
    x = "kinv",
    y = "test_error"
)

The blue curve is the training error, while the orange curve is the test error.  Notice how underfitting occurs for very high values of K and notice how overfitting occurs for smaller values of K.

In [30]:
ctrain+ctest