Homework 6
Contents
Homework 6¶
List your name and the names of any collaborators at the top of this notebook.
(Reminder: It’s encouraged to work together; you can even submit the exact same homework as another student or two students, but you must list each other’s names at the top.)
Introduction¶
This homework will be good preparation for the final project, because it is more open-ended than the typical homeworks.
The goal is to use KNeighborsClassifier
to investigate some aspect of the taxis dataset from Seaborn. Originally I was going to tell you specifically what columns to use, but I wasn’t satisfied with my results and I think you can come up with something better.
import seaborn as sns
import numpy as np
import pandas as pd
import altair as alt
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import log_loss
df = sns.load_dataset("taxis")
df.head()
pickup | dropoff | passengers | distance | fare | tip | tolls | total | color | payment | pickup_zone | dropoff_zone | pickup_borough | dropoff_borough | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2019-03-23 20:21:09 | 2019-03-23 20:27:24 | 1 | 1.60 | 7.0 | 2.15 | 0.0 | 12.95 | yellow | credit card | Lenox Hill West | UN/Turtle Bay South | Manhattan | Manhattan |
1 | 2019-03-04 16:11:55 | 2019-03-04 16:19:00 | 1 | 0.79 | 5.0 | 0.00 | 0.0 | 9.30 | yellow | cash | Upper West Side South | Upper West Side South | Manhattan | Manhattan |
2 | 2019-03-27 17:53:01 | 2019-03-27 18:00:25 | 1 | 1.37 | 7.5 | 2.36 | 0.0 | 14.16 | yellow | credit card | Alphabet City | West Village | Manhattan | Manhattan |
3 | 2019-03-10 01:23:59 | 2019-03-10 01:49:51 | 1 | 7.70 | 27.0 | 6.15 | 0.0 | 36.95 | yellow | credit card | Hudson Sq | Yorkville West | Manhattan | Manhattan |
4 | 2019-03-30 13:27:42 | 2019-03-30 13:37:14 | 3 | 2.16 | 9.0 | 1.10 | 0.0 | 13.40 | yellow | credit card | Midtown East | Yorkville West | Manhattan | Manhattan |
Assignment¶
Pose a question related to the taxis dataset loaded above, and investigate that question using KNeighborsClassifier
. For example, if we were instead working with the penguins dataset, the question might be something like, “Can we use flipper length and bill length to predict the species of penguin?” Make sure you’re posing a classification problem and not a regression problem.
Address the following points.
State explicitly what question you are investigating. (It doesn’t need to be a question with a definitive answer.)
Convert at least one of the
pickup
and/ordropoff
column into adatetime
data type, and use some aspect of that column in your analysis. (For example, you could use.dt.hour
or.dt.day_name()
… for some reasonhour
does not include parentheses butday_name()
does include parentheses.)Include at least one Boolean column in your
X
data. (There aren’t any Boolean columns in this dataset, so you will have to produce one. Producing new columns like this is called feature engineering. For example, with the penguins dataset, we could create a Boolean column indicating if the bill length is over 5cm.)For numerical columns (or Boolean columns) that you use in your
X
data, rescale them usingStandardScaler
and use the scaled versions when fitting (and predicting) withKNeighborsClassifier
. (Actually, every column fed to theX
portion of theKNeighborsClassifier
should be either numerical or Boolean… it does not accept categorical values in theX
. If you want to use a categorical value in theX
, you need to convert it somehow into a numerical or Boolean value.)Use
train_test_split
to attempt to detect over-fitting or under-fitting. Evaluate the performance of your classifier using thelog_loss
metric that was imported above.Make a plot in Altair related to your question. (It’s okay if the plot is just loosely related to your question. For example, if you are using many different columns, it would be difficult to show all of that information in a plot.) This dataset is about 6000 rows long, which is too long for Altair by default, but you can disable that using
alt.data_transformers.disable_max_rows()
. (This would be a bad idea for a huge dataset, but with this dataset it should be fine.)State a specific value of
k
for which thisKNeighborsClassifier
seems to perform best (meaning thelog_loss
error for the test set is lowest… it’s okay if your k is just an estimate). For example, if you look at the test error curve at the bottom of the notebook from Wednesday Week 6, you’ll see that for that problem, the regressor performed best when 1/k was between 0.1 and 0.2, so when k was between 5 and 10. (If you find that the performance is best with the biggest possiblek
, that probably means thatKNeighborsClassifier
is not an effective tool for your specific choice ofX
data andy
data. That’s okay but it would be even better if you could make some adjustment.)
Submission¶
Download the .ipynb file for this notebook (click on the folder icon to the left, then the … next to the file name) and upload the file on Canvas.