Week 7 Friday#
Announcements#
One of the upcoming worksheets will be on the MNIST dataset of handwritten digits which Jinghao introduced yesterday.
Midterm 2 is two weeks from today. Similar to Midterm 1 (can use a notecard with handwritten notes on it, otherwise closed book and closed computer). Similar length to Midterm 1.
We don’t have a final exam in this class, but there will be a “course project” due during finals week.
Warm-up: Most common values#
import seaborn as sns
import altair as alt
import pandas as pd
df_pre = sns.load_dataset("taxis")
Here is a reminder of how the dataset looks.
df_pre
pickup | dropoff | passengers | distance | fare | tip | tolls | total | color | payment | pickup_zone | dropoff_zone | pickup_borough | dropoff_borough | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2019-03-23 20:21:09 | 2019-03-23 20:27:24 | 1 | 1.60 | 7.0 | 2.15 | 0.0 | 12.95 | yellow | credit card | Lenox Hill West | UN/Turtle Bay South | Manhattan | Manhattan |
1 | 2019-03-04 16:11:55 | 2019-03-04 16:19:00 | 1 | 0.79 | 5.0 | 0.00 | 0.0 | 9.30 | yellow | cash | Upper West Side South | Upper West Side South | Manhattan | Manhattan |
2 | 2019-03-27 17:53:01 | 2019-03-27 18:00:25 | 1 | 1.37 | 7.5 | 2.36 | 0.0 | 14.16 | yellow | credit card | Alphabet City | West Village | Manhattan | Manhattan |
3 | 2019-03-10 01:23:59 | 2019-03-10 01:49:51 | 1 | 7.70 | 27.0 | 6.15 | 0.0 | 36.95 | yellow | credit card | Hudson Sq | Yorkville West | Manhattan | Manhattan |
4 | 2019-03-30 13:27:42 | 2019-03-30 13:37:14 | 3 | 2.16 | 9.0 | 1.10 | 0.0 | 13.40 | yellow | credit card | Midtown East | Yorkville West | Manhattan | Manhattan |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
6428 | 2019-03-31 09:51:53 | 2019-03-31 09:55:27 | 1 | 0.75 | 4.5 | 1.06 | 0.0 | 6.36 | green | credit card | East Harlem North | Central Harlem North | Manhattan | Manhattan |
6429 | 2019-03-31 17:38:00 | 2019-03-31 18:34:23 | 1 | 18.74 | 58.0 | 0.00 | 0.0 | 58.80 | green | credit card | Jamaica | East Concourse/Concourse Village | Queens | Bronx |
6430 | 2019-03-23 22:55:18 | 2019-03-23 23:14:25 | 1 | 4.14 | 16.0 | 0.00 | 0.0 | 17.30 | green | cash | Crown Heights North | Bushwick North | Brooklyn | Brooklyn |
6431 | 2019-03-04 10:09:25 | 2019-03-04 10:14:29 | 1 | 1.12 | 6.0 | 0.00 | 0.0 | 6.80 | green | credit card | East New York | East Flatbush/Remsen Village | Brooklyn | Brooklyn |
6432 | 2019-03-13 19:31:22 | 2019-03-13 19:48:02 | 1 | 3.85 | 15.0 | 3.36 | 0.0 | 20.16 | green | credit card | Boerum Hill | Windsor Terrace | Brooklyn | Brooklyn |
6433 rows Ă— 14 columns
I don’t think it’s realistic to perform logistic regression directly on the “pickup_zone” column, because there are so many values. Here are those values.
df_pre["pickup_zone"].unique()
array(['Lenox Hill West', 'Upper West Side South', 'Alphabet City',
'Hudson Sq', 'Midtown East', 'Times Sq/Theatre District',
'Battery Park City', 'Murray Hill', 'East Harlem South',
'Lincoln Square East', 'LaGuardia Airport', 'Lincoln Square West',
'Financial District North', 'Upper West Side North',
'East Chelsea', 'Midtown Center', 'Gramercy',
'Penn Station/Madison Sq West', 'Sutton Place/Turtle Bay North',
'West Chelsea/Hudson Yards', 'Clinton East', 'Clinton West',
'UN/Turtle Bay South', 'Midtown South', 'Midtown North',
'Garment District', 'Lenox Hill East', 'Flatiron',
'TriBeCa/Civic Center', nan, 'Upper East Side North',
'West Village', 'Greenwich Village South', 'JFK Airport',
'East Village', 'Union Sq', 'Yorkville West', 'Central Park',
'Meatpacking/West Village West', 'Kips Bay', 'Morningside Heights',
'Astoria', 'East Tremont', 'Upper East Side South',
'Financial District South', 'Bloomingdale', 'Queensboro Hill',
'SoHo', 'Brooklyn Heights', 'Yorkville East', 'Manhattan Valley',
'DUMBO/Vinegar Hill', 'Little Italy/NoLiTa',
'Mott Haven/Port Morris', 'Greenwich Village North',
'Stuyvesant Heights', 'Lower East Side', 'East Harlem North',
'Chinatown', 'Fort Greene', 'Steinway', 'Central Harlem',
'Crown Heights North', 'Seaport', 'Two Bridges/Seward Park',
'Boerum Hill', 'Williamsburg (South Side)', 'Rosedale', 'Flushing',
'Old Astoria', 'Soundview/Castle Hill',
'Stuy Town/Peter Cooper Village', 'World Trade Center',
'Sunnyside', 'Washington Heights South', 'Prospect Heights',
'East New York', 'Hamilton Heights', 'Cobble Hill',
'Long Island City/Queens Plaza', 'Central Harlem North',
'Manhattanville', 'East Flatbush/Farragut', 'Elmhurst',
'East Concourse/Concourse Village', 'Park Slope', 'Greenpoint',
'Williamsburg (North Side)', 'Long Island City/Hunters Point',
'South Ozone Park', 'Ridgewood', 'Downtown Brooklyn/MetroTech',
'Queensbridge/Ravenswood', 'Williamsbridge/Olinville', 'Bedford',
'Gowanus', 'Jackson Heights', 'South Jamaica', 'Bushwick North',
'West Concourse', 'Queens Village', 'Windsor Terrace', 'Flatlands',
'Van Cortlandt Village', 'Woodside', 'East Williamsburg',
'Fordham South', 'East Elmhurst', 'Kew Gardens',
'Flushing Meadows-Corona Park', 'Marine Park/Mill Basin',
'Carroll Gardens', 'Canarsie', 'East Flatbush/Remsen Village',
'Jamaica', 'Marble Hill', 'Bushwick South', 'Erasmus',
'Claremont/Bathgate', 'Pelham Bay', 'Soundview/Bruckner',
'South Williamsburg', 'Battery Park', 'Forest Hills', 'Maspeth',
'Bronx Park', 'Starrett City', 'Brighton Beach', 'Brownsville',
'Highbridge Park', 'Bensonhurst East', 'Mount Hope',
'Prospect-Lefferts Gardens', 'Bayside', 'Douglaston', 'Midwood',
'North Corona', 'Homecrest', 'Westchester Village/Unionport',
'University Heights/Morris Heights', 'Inwood',
'Washington Heights North', 'Flatbush/Ditmas Park', 'Rego Park',
'Riverdale/North Riverdale/Fieldston', 'Jamaica Estates',
'Borough Park', 'Sunset Park West', 'Belmont', 'Auburndale',
'Schuylerville/Edgewater Park', 'Co-Op City',
'Crown Heights South', 'Spuyten Duyvil/Kingsbridge',
'Morrisania/Melrose', 'Hollis', 'Parkchester', 'Coney Island',
'East Flushing', 'Richmond Hill', 'Bedford Park', 'Highbridge',
'Clinton Hill', 'Sheepshead Bay', 'Madison', 'Dyker Heights',
'Cambria Heights', 'Pelham Parkway', 'Hunts Point',
'Melrose South', 'Springfield Gardens North', 'Bay Ridge',
'Elmhurst/Maspeth', 'Crotona Park East', 'Bronxdale',
'Briarwood/Jamaica Hills', 'Van Nest/Morris Park',
'Murray Hill-Queens', 'Kingsbridge Heights', 'Whitestone',
'Saint Albans', 'Allerton/Pelham Gardens', 'Howard Beach',
'Norwood', 'Bensonhurst West', 'Columbia Street', 'Middle Village',
'Prospect Park', 'Ozone Park', 'Gravesend', 'Glendale',
'Kew Gardens Hills', 'Woodlawn/Wakefield',
'West Farms/Bronx River', 'Hillcrest/Pomonok'], dtype=object)
In total, there are 195
unique values in the “pickup_zone” column.
len(df_pre["pickup_zone"].unique())
195
In the taxis dataset from Seaborn, keep only the rows with a pickup zone that occurs at least 200 times in the dataset. Store the resulting DataFrame as
df
.
This is an example of working with a pandas Series. Most of our examples of pandas Series are columns in a DataFrame, but in this case, the value_counts
method returns a pandas Series also.
vc = df_pre["pickup_zone"].value_counts()
There are lots of nice properties of this particular Series. The terms are written in order from most frequent to least frequent (in the index). The values correspond to how often the terms occur.
vc
Midtown Center 230
Upper East Side South 211
Penn Station/Madison Sq West 210
Clinton East 208
Midtown East 198
...
Riverdale/North Riverdale/Fieldston 1
Ozone Park 1
Hollis 1
Auburndale 1
Homecrest 1
Name: pickup_zone, Length: 194, dtype: int64
Here is a Boolean Series indicating which occur at least 200
times.
vc >= 200
Midtown Center True
Upper East Side South True
Penn Station/Madison Sq West True
Clinton East True
Midtown East False
...
Riverdale/North Riverdale/Fieldston False
Ozone Park False
Hollis False
Auburndale False
Homecrest False
Name: pickup_zone, Length: 194, dtype: bool
Here we perform Boolean indexing to keep only those entries where the value is at least 200
.
vc[vc >= 200]
Midtown Center 230
Upper East Side South 211
Penn Station/Madison Sq West 210
Clinton East 208
Name: pickup_zone, dtype: int64
If we only care about the index, then we can use the following. Read this as, “Keep only those index terms for which the value is greater than 200
.”
vc.index[vc >= 200]
Index(['Midtown Center', 'Upper East Side South',
'Penn Station/Madison Sq West', 'Clinton East'],
dtype='object')
Here is an alternative approach. Read this as, “Keep only those Series entries for which the value is greater than 200
, and then extract the index
from that Series.”
vc[vc >= 200].index
Index(['Midtown Center', 'Upper East Side South',
'Penn Station/Madison Sq West', 'Clinton East'],
dtype='object')
Aside: I didn’t realize that Series also had a keys
method, but here we see that they do. This was suggested by a student during class.
# I didn't know this worked
vc.keys()
Index(['Midtown Center', 'Upper East Side South',
'Penn Station/Madison Sq West', 'Clinton East', 'Midtown East',
'Upper East Side North', 'Times Sq/Theatre District', 'Union Sq',
'Lincoln Square East', 'Murray Hill',
...
'Bedford Park', 'Whitestone', 'Crotona Park East', 'Queens Village',
'Columbia Street', 'Riverdale/North Riverdale/Fieldston', 'Ozone Park',
'Hollis', 'Auburndale', 'Homecrest'],
dtype='object', length=194)
Let’s store the above pandas Index (the ones corresponding to at least 200
rows) with the variable name pz200
(short for “pickup zone 200”).
pz200 = vc[vc >= 200].index
pz200
Index(['Midtown Center', 'Upper East Side South',
'Penn Station/Madison Sq West', 'Clinton East'],
dtype='object')
We can turn any list-like object (like the above pz200
, which is a pandas Index, not a list) into a Boolean Series by passing it as an argument to the isin
method. For example, the following is showing which of the pickup zones are in our pz200
variable.
df_pre["pickup_zone"].isin(pz200)
0 False
1 False
2 False
3 False
4 False
...
6428 False
6429 False
6430 False
6431 False
6432 False
Name: pickup_zone, Length: 6433, dtype: bool
We can use Boolean indexing with this isin
method, to get only the rows for which the pickup zone is one of the entries that occurs at least 200
times.
If you look over the values of “pickup_zone” in df
, you’ll find that the only values that occur are the four from our pz200
variable.
df = df_pre[df_pre["pickup_zone"].isin(pz200)]
df
pickup | dropoff | passengers | distance | fare | tip | tolls | total | color | payment | pickup_zone | dropoff_zone | pickup_borough | dropoff_borough | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
17 | 2019-03-23 20:50:49 | 2019-03-23 21:02:07 | 1 | 2.60 | 10.5 | 2.00 | 0.0 | 16.30 | yellow | credit card | Midtown Center | East Harlem South | Manhattan | Manhattan |
20 | 2019-03-21 03:37:34 | 2019-03-21 03:44:13 | 1 | 1.07 | 6.5 | 1.54 | 0.0 | 11.84 | yellow | credit card | Penn Station/Madison Sq West | Kips Bay | Manhattan | Manhattan |
21 | 2019-03-25 23:05:54 | 2019-03-25 23:11:13 | 1 | 0.80 | 5.5 | 2.30 | 0.0 | 11.60 | yellow | credit card | Penn Station/Madison Sq West | Murray Hill | Manhattan | Manhattan |
27 | 2019-03-16 20:30:36 | 2019-03-16 20:46:22 | 1 | 2.60 | 12.5 | 3.26 | 0.0 | 19.56 | yellow | credit card | Clinton East | Lenox Hill West | Manhattan | Manhattan |
31 | 2019-03-01 02:55:55 | 2019-03-01 02:57:59 | 3 | 0.74 | 4.0 | 0.00 | 0.0 | 7.80 | yellow | cash | Clinton East | West Chelsea/Hudson Yards | Manhattan | Manhattan |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
5403 | 2019-03-15 08:42:56 | 2019-03-15 08:47:27 | 1 | 0.64 | 5.0 | 1.00 | 0.0 | 9.30 | yellow | credit card | Upper East Side South | Lenox Hill East | Manhattan | Manhattan |
5404 | 2019-03-20 21:03:10 | 2019-03-20 21:11:12 | 1 | 1.79 | 8.5 | 0.00 | 0.0 | 12.30 | yellow | cash | Midtown Center | Upper East Side North | Manhattan | Manhattan |
5423 | 2019-03-22 13:47:56 | 2019-03-22 14:01:32 | 1 | 1.00 | 9.5 | 2.00 | 0.0 | 14.80 | yellow | credit card | Midtown Center | Union Sq | Manhattan | Manhattan |
5444 | 2019-03-05 21:39:03 | 2019-03-05 21:49:12 | 5 | 1.31 | 8.0 | 0.00 | 0.0 | 11.80 | yellow | cash | Clinton East | Midtown East | Manhattan | Manhattan |
5445 | 2019-03-13 10:57:06 | 2019-03-13 11:03:29 | 1 | 0.83 | 6.0 | 1.86 | 0.0 | 11.16 | yellow | credit card | Upper East Side South | Upper East Side North | Manhattan | Manhattan |
859 rows Ă— 14 columns
How many rows are in
df
?
I had originally planned to relate this to vc
above, but I skipped that for time considerations. But it would be good to see if you can recover the following 859
number using only the pandas Series vc
.
len(df)
859
Draw a scatter plot in Altair encoding the “distance” in the x-channel, the “total” in the y-channel, and the “pickup_zone” in the color.
Let’s first look at df_pre
instead of df
. This does not seem like a good candidate for Machine Learning to me, because it has so many classes (in fact, it has 194 classes).
# original data
alt.Chart(df_pre.sample(5000)).mark_circle().encode(
x="distance",
y="total",
color="pickup_zone"
)
It’s a little less chaotic when we switch from df_pre
to df
. Now there are only four pickup zones.
On the other hand, it still does not seem like a good candidate for predictions, because there are no clear patterns in this data.
# only the top pickup zones
alt.Chart(df).mark_circle().encode(
x="distance",
y="total",
color="pickup_zone"
)
Do you expect to be able to predict the pickup zone from this data?
No, there’s no clear pattern to the pickup zones in terms of distance and total fare.
Ten minutes to work on Worksheets#
Work on the worksheets due Monday. Yufei, Hanson, and I are here to answer questions.
If you’re already done:
Fit a logistic regression classifier to the above taxis data, using “distance” and “total” as your input features, and using “pickup_zone” as your target.
When I tried this, I got 33% accuracy. This is a little better than random guessing, but not much.
Overfitting#
I think the most important concept in Machine Learning is the concept of overfitting. The basic idea is that if we have a very flexible model (for example, a model with many parameters), it may perform very well on our data, but it may not generalize well to new data.
When the model is too flexible, it may simply be memorizing random noise within the data, rather than learning the true underlying structure.
Import a Decision Tree model from
sklearn.tree
. We will be using “distance” and “total” as our input features, and using “pickup_zone” as our target. So should this be aDecisionTreeClassifier
or aDecisionTreeRegressor
?
Our inputs are numeric and our outputs are discrete classes. That means this is a classification task: all that matters is the outputs, not the inputs. So we use a DecisionTreeClassifier
.
We will discuss what a DecisionTreeClassifier
actually does next week. For now, just know that it is another model for classification, like logistic regression.
from sklearn.tree import DecisionTreeClassifier
Usually we will pass at least one keyword argument to the following constructor (putting some constraint on clf
), but here we just use the default values. Because we are not placing any constraints on the complexity of the decision tree, we are very much at risk of overfitting.
clf = DecisionTreeClassifier()
Fit the model to the data.
Here I name the predictor columns so I can save some typing later.
cols = ["distance", "total"]
Here we fit the model to the data. Even though we don’t know what a decision tree is, we can do this fitting easily, because it is the same syntax as for other models in scikit-learn.
clf.fit(df[cols], df["pickup_zone"])
DecisionTreeClassifier()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier()
What is the model’s accuracy? Use the
score
method.
Wow, we’ve gotten 93%. Is that good? No, it is almost certainly bad! Random guessing would give us a 25% accuracy (since there are four classes), and this is so much higher. Do you really think there is any model that takes as input the fare and the distance and as output returns the pickup zone (from among these four options) with 93% accuracy? Would you even expect a human expert to be able to do that?
clf.score(df[cols], df["pickup_zone"])
0.9324796274738067
Detecting overfitting using a test set#
Import the
train_test_split
function fromsklearn.model_selection
.
from sklearn.model_selection import train_test_split
Divide the data into a training set and a test set. Use 20% of the rows for the test set.
It’s worth memorizing the syntax on the left-hand side of the following expression, because we will use it often. The input (df[cols]
) gets divided into two components, a training set and a test set, named X_train
and X_test
. Similarly for the output (df["pickup_zone"]
), it gets divided into y_train
and y_test
.
We specify that the test set should be 20% of the data, by using test_size=0.2
. If we had used test_size=80
, for example, then the test set would have 80
rows.
X_train, X_test, y_train, y_test = train_test_split(df[cols], df["pickup_zone"], test_size=0.2)
Here is what X_test
looks like. This corresponds to a random 20% of the input data.
X_test
distance | total | |
---|---|---|
2234 | 1.70 | 24.35 |
4069 | 0.45 | 9.96 |
556 | 4.70 | 26.15 |
4621 | 0.80 | 12.35 |
2763 | 15.83 | 68.47 |
... | ... | ... |
2417 | 3.20 | 15.80 |
2522 | 1.44 | 14.16 |
2695 | 1.34 | 14.14 |
4811 | 1.49 | 17.16 |
5159 | 1.90 | 18.36 |
172 rows Ă— 2 columns
Here are the true answers corresponding to X_test
. Notice how the index values are the same for both, starting with 2234
, then 4069
, etc. (If we rerun the code, there will be different values showing up, because we didn’t do anything to make the train_test_split
random selections reproducible.)
y_test
2234 Penn Station/Madison Sq West
4069 Upper East Side South
556 Clinton East
4621 Upper East Side South
2763 Clinton East
...
2417 Clinton East
2522 Upper East Side South
2695 Upper East Side South
4811 Penn Station/Madison Sq West
5159 Midtown Center
Name: pickup_zone, Length: 172, dtype: object
Fit a decision tree model using the training set.
clf2 = DecisionTreeClassifier()
A major conceptual mistake (not just a small typo) would be to fit this on the test set. The whole point is to keep the test set secret during the fitting process.
clf2.fit(X_train, y_train)
DecisionTreeClassifier()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier()
What is the accuracy of the model on the training set?
Here we’ve achieved even a slightly better performance than above.
# 94% accuracy (your numbers will be different)
clf2.score(X_train, y_train)
0.9417758369723436
What is the accuracy of the model on the test set?
The following result is the most important part of this notebook. Notice how we have crashed from 94% accuracy to only slightly better accuracy than random guessing. This is a very strong sign that our model has been overfitting the data. It has learned the data very well, but there is no evidence that our model will perform well on new, unseen data.
clf2.score(X_test, y_test)
0.28488372093023256
How does this result suggest overfitting?
The much better score on the training set than the test set is a strong sign of overfitting.