Homework

You can download this Jupyter notebook by clicking the download arrow at the top right, and then choosing the .ipynb file.

K-Nearest Neighbors classifier

We will study the cars dataset from vega_datasets using a new model, the K-Nearest Neighbors classifier.

The K-Nearest Neighbors algorithm is the easiest algorithm in Math 10. It’s very similar to the instance-based learning from the Hands On Machine Learning book that we read about a few weeks ago. Here is the algorithm.

  • Choose a number k.

  • To make a prediction for a point x, find the k closest points in the dataset, and choose the class which appears most often among those k points. (A slight alternate is to assign probabilities based on those k points. For example, if k is 10 and 6 of the nearest points are in class A, 3 in class B, and 1 in class C, then we would estimate that our point has a 0.6 probability of being in class A, a 0.3 probability of being in class B, and a 0.1 probability of being in class C. If we just want to make a prediction, and not give probabilities, we would predict class A.)

References:

Starter code

import altair as alt
import pandas as pd
from pandas.api.types import is_numeric_dtype
from vega_datasets import data
from sklearn.preprocessing import StandardScaler

We will use a very famous dataset in the field of Machine Learning. It is about cars.

source = data.cars()

We will just use three columns from this dataset.

col_a = "Weight_in_lbs"
col_b = "Miles_per_Gallon"
df = source[[col_a,col_b,"Origin"]].copy()
alt.Chart(df).mark_circle().encode(
    x = alt.X(col_a,scale=alt.Scale(zero=False)),
    y = alt.Y(col_b,scale=alt.Scale(zero=False)),
    color='Origin'
)

Question 1

Remove the rows containing at least one NaN value. (Hint. The result should contain 398 data points.)

# Your answer here

Question 2

Why is this a natural dataset to use StandardScaler on?

Your answer here

Scaling the data

We scale the data using StandardScaler and we temporarily store that scaled data in df2.

scaler = StandardScaler()
df2 = df[[col_a,col_b]]
scaler.fit(df2)
df2 = scaler.transform(df2)

Question 3

Put those columns back into df. Overwrite the old data, so there should still be just three columns in df. If you evaluate df.columns, you should see Index(['Weight_in_lbs', 'Miles_per_Gallon', 'Origin'], dtype='object').

# Your code here

Plotting the scaled data

If everything went correctly, the chart should look almost identical, just with different numbers along the axes.

alt.Chart(df).mark_circle().encode(
    x = alt.X(col_a,scale=alt.Scale(zero=False)),
    y = alt.Y(col_b,scale=alt.Scale(zero=False)),
    color='Origin'
)

Question 4

Instantiate a new KNeighborsClassifier (you’ll have to import it and then instantiate it) using n_neighbors = 1. Fit the classifier using our two numerical columns for the input X, and using the origin column for the output y. Name the classifier clf.

# Your code here

Using clf to make a new prediction column

We can now make a new prediction column using that classifier.

df["pred1"] = clf.predict(df[[col_a,col_b]])

Question 5

Why does the following “prediction” chart look identical?

alt.Chart(df).mark_circle().encode(
    x = alt.X("Weight_in_lbs",scale=alt.Scale(zero=False)),
    y = alt.Y("Miles_per_Gallon",scale=alt.Scale(zero=False)),
    color='pred1'
)

Your answer here

Question 6

Repeat the steps, starting with where we instantiated the classifier, this time using 10 neighbors. Then again using 50 neighbors. Show both plots. (In your submission of this homework, both plots should appear, so don’t delete one of them to make the next one.)

# Your code here

Question 7

Recall that a machine learning model with more variance is more prone to over-fitting, and a machine learning model with more bias is more prone to under-fitting. For example, linear regression is more on the bias side, and a high-degree polynomial regression is more on the variance side. This is what’s known as the bias-variance tradeoff. Do you think the K-Nearest Neighbors algorithm has more bias when K is a bigger number or a smaller number?

Your answer here

Example of predicting with new data

Here is an example of getting a prediction and probabilities for a car weighing 2000 pounds and getting 26 miles-per-gallon. You can ignore the warning that shows up; it’s just saying that our input does not include column names.

A = scaler.transform([[2000,26]])
A
clf.predict(A)

It is more interesting and meaningful to get probabilities rather than just a single prediction.

clf.predict_proba(A)

Those probabilities correspond to these classes (they are listed in the same order). There is no way to know what probability corresponds to what class without evaluating clf.classes_.

clf.classes_