K-Nearest Neighbors Regressor
K-Nearest Neighbors Regressor¶
Before the recording, we introduced the K-Nearest Neighbors Classifier and the K-Nearest Neighbors Regressor. We mentioned that larger K corresponds to smaller variance (so over-fitting is more likely to occur with smaller values of K). We also discussed the training error curve and test error curve, like from the figures in Chapter 2 of Introduction to Statistical Learning.
import numpy as np
import pandas as pd
import seaborn as sns
import altair as alt
from sklearn.neighbors import KNeighborsRegressor, KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error
df = sns.load_dataset("penguins")
#df = df.dropna()
df.dropna(inplace=True)
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 333 entries, 0 to 343
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 species 333 non-null object
1 island 333 non-null object
2 bill_length_mm 333 non-null float64
3 bill_depth_mm 333 non-null float64
4 flipper_length_mm 333 non-null float64
5 body_mass_g 333 non-null float64
6 sex 333 non-null object
dtypes: float64(4), object(3)
memory usage: 20.8+ KB
It would be better to rescale the data first (i.e., to normalize the data). We’ll talk about that soon but we’re skipping it for now.
X_train, X_test, y_train, y_test = train_test_split(
df[["bill_length_mm","bill_depth_mm","flipper_length_mm"]], df["body_mass_g"], test_size = 0.5)
X_train.shape
(166, 3)
The syntax for performing K-Nearest Neighbors regression using scikit-learn is essentially the same as the syntax for performing linear regression.
reg = KNeighborsRegressor(n_neighbors=10)
reg.fit(X_train, y_train)
KNeighborsRegressor(n_neighbors=10)
reg.predict(X_test)
array([4980. , 3792.5, 3600. , 5062.5, 3592.5, 4027.5, 4980. , 4012.5,
3792.5, 5522.5, 3607.5, 3380. , 5162.5, 3892.5, 5477.5, 5585. ,
5022.5, 5562.5, 3557.5, 3912.5, 3265. , 3462.5, 5067.5, 5422.5,
5092.5, 5585. , 3467.5, 4052.5, 3952.5, 3697.5, 3755. , 3647.5,
5492.5, 3395. , 5457.5, 4075. , 3590. , 3422.5, 3655. , 4520. ,
4935. , 4030. , 3222.5, 4490. , 3575. , 3455. , 3562.5, 5472.5,
4130. , 3825. , 5107.5, 4692.5, 3367.5, 5052.5, 4027.5, 3500. ,
3842.5, 4760. , 5585. , 4115. , 5512.5, 3522.5, 3667.5, 4052.5,
5625. , 4755. , 4062.5, 4030. , 4570. , 5207.5, 3225. , 3460. ,
3267.5, 3630. , 3305. , 3842.5, 5585. , 3515. , 4490. , 3447.5,
3600. , 5562.5, 3612.5, 3787.5, 4017.5, 3525. , 3727.5, 3630. ,
4705. , 4130. , 4112.5, 3480. , 4012.5, 3485. , 4992.5, 4057.5,
4857.5, 3232.5, 3675. , 3870. , 4042.5, 3635. , 3670. , 4957.5,
5427.5, 3912.5, 5235. , 4677.5, 4595. , 3780. , 5000. , 4130. ,
3725. , 3357.5, 3442.5, 5000. , 3705. , 4012.5, 3607.5, 3440. ,
5397.5, 3805. , 3467.5, 4080. , 3655. , 3367.5, 3625. , 3730. ,
4550. , 3652.5, 3812.5, 5017.5, 4115. , 4705. , 3582.5, 3710. ,
5557.5, 5305. , 5585. , 3522.5, 3607.5, 4010. , 4755. , 5585. ,
3892.5, 3575. , 3982.5, 5555. , 4475. , 3467.5, 3750. , 3562.5,
3862.5, 3892.5, 5147.5, 3707.5, 3675. , 4475. , 3672.5, 3720. ,
5625. , 5537.5, 3390. , 3655. , 3585. , 5297.5, 3805. ])
X_test.shape
(167, 3)
mean_absolute_error(reg.predict(X_test), y_test)
282.6646706586826
mean_absolute_error(reg.predict(X_train), y_train)
254.23192771084337
The above numbers are similar, with reg
performing just slightly better on the training data. That suggests that for this training set, we are not overfitting the data when using K=10.
def get_scores(k):
reg = KNeighborsRegressor(n_neighbors=k)
reg.fit(X_train, y_train)
train_error = mean_absolute_error(reg.predict(X_train), y_train)
test_error = mean_absolute_error(reg.predict(X_test), y_test)
return (train_error, test_error)
get_scores(10)
(254.23192771084337, 282.6646706586826)
get_scores(1)
(0.0, 338.92215568862275)
df_scores = pd.DataFrame({"k":range(1,150),"train_error":np.nan,"test_error":np.nan})
df_scores
k | train_error | test_error | |
---|---|---|---|
0 | 1 | NaN | NaN |
1 | 2 | NaN | NaN |
2 | 3 | NaN | NaN |
3 | 4 | NaN | NaN |
4 | 5 | NaN | NaN |
... | ... | ... | ... |
144 | 145 | NaN | NaN |
145 | 146 | NaN | NaN |
146 | 147 | NaN | NaN |
147 | 148 | NaN | NaN |
148 | 149 | NaN | NaN |
149 rows × 3 columns
df_scores.loc[0,["train_error","test_error"]] = get_scores(1)
df_scores.head()
k | train_error | test_error | |
---|---|---|---|
0 | 1 | 0.0 | 338.922156 |
1 | 2 | NaN | NaN |
2 | 3 | NaN | NaN |
3 | 4 | NaN | NaN |
4 | 5 | NaN | NaN |
We often avoid using for
loops in Math 10, but I couldn’t find a better way to fill in this data. Let me know if you see a more Pythonic approach!
for i in df_scores.index:
df_scores.loc[i,["train_error","test_error"]] = get_scores(df_scores.loc[i,"k"])
df_scores
k | train_error | test_error | |
---|---|---|---|
0 | 1 | 0.000000 | 338.922156 |
1 | 2 | 189.533133 | 290.943114 |
2 | 3 | 209.236948 | 284.780439 |
3 | 4 | 225.828313 | 280.688623 |
4 | 5 | 240.993976 | 281.976048 |
... | ... | ... | ... |
144 | 145 | 599.882634 | 547.684287 |
145 | 146 | 605.592920 | 552.209622 |
146 | 147 | 609.482829 | 556.251782 |
147 | 148 | 613.268276 | 559.163093 |
148 | 149 | 616.836136 | 562.464333 |
149 rows × 3 columns
Usually when we plot a test error curve, we want higher flexibility (= higher variance) on the right. Since higher values of K correspond to lower flexibility, we are going to add a column to the DataFrame containing the reciprocals of the K values.
df_scores["kinv"] = 1/df_scores.k
ctrain = alt.Chart(df_scores).mark_line().encode(
x = "kinv",
y = "train_error"
)
ctest = alt.Chart(df_scores).mark_line(color="orange").encode(
x = "kinv",
y = "test_error"
)
The blue curve is the training error, while the orange curve is the test error. Notice how underfitting occurs for very high values of K and notice how overfitting occurs for smaller values of K.
ctrain+ctest