Midterm Review
Contents
Midterm Review¶
Nothing scheduled today; ask questions!
Announcements¶
I have notecards at the front.
Midterm tomorrow during discussion section!
Yasmeen has posted both handwritten solutions and Deepnote solutions to the Sample Midterm. These solutions are at the top of the Week 8 page on Canvas.
The video quizzes should all be re-opened (taking the video quizzes now will not affect your grade). Feel free to attempt them as many times as you want. The links are on the individual Canvas Week pages, under the corresponding videos.
Friday I will introduce the course project. There will be a short homework due Tuesday Week 9 (posted soon!) asking you to find a possible dataset and topic for the course project.
Week 9 will introduce new Machine Learning material: decision trees and random forests
The last in-class quiz of the quarter will be Tuesday, Week 10.
Midterm topics¶
3 questions (multipart).
Most of the questions relate directly to Machine Learning.
Machine Learning in scikit-learn:
The usual procedure: import, create/instantiate, fit, predict/transform
StandardScaler
KMeans
LinearRegression
, polynomial regressionLogisticRegression
train_test_split
mean_squared_error
andmean_absolute_error
fit
,predict
,transform
coef_
andintercept_
predict_proba
andscore
Common errors (when does the data need to be two-dimensional?)
Machine Learning theory:
Unsupervised learning vs Supervised learning. Regression vs classification.
What is the K-Means clustering algorithm (i.e., what are its steps)?
When is rescaling data important?
Overfitting, underfitting (relationship to flexibility/polynomial regression, training set/test set, test error curve)
Interpretation of coefficients in linear regression (interpretation 1: which features are the most important? interpretation 2: what happens when we increase this value?)
Interpretation of coefficients in logistic regression in the binary case (how do these relate to probabilities?)
The definition of the sigmoid function; its relationship to logistic regression; why is it a natural choice of function for modeling a probability?
What is a decision boundary in logistic regression? Why is logistic regression in the
sklearn.linear_model
module?Why shouldn’t we use linear regression for a classification problem?
In what sense are we finding the “best” coefficients in linear regression?
Which of the following is not like the others? accuracy, cost, error rate, loss
Which of the following is not like the others? feature, input, predictor, target
How do outliers impact the values of
mean_absolute_error
andmean_squared_error
?The MNIST dataset. (What are the dimensions? How do the values relate to images? Why is it considered a classification problem?)
One or two parts might be only indirectly related to Machine Learning. Most of those possible topics involve pandas:
Two ways to get datetime values in pandas: using the
dt
accessor or usingmap
.apply
groupby
merging two DataFrames
Feature Engineering, for example, adding columns representing Day of the Week with the Seattle bike data. (This is very much a Machine Learning concept, but the difficulty with it is implementing it using pandas.)
Selections in Altair (code like
fields=["Close"]
), using a selection with a conditionDisplaying multiple Altair charts using
vconcat
Using
copy
in pandas and in AltairReshaping a NumPy array using
reshape
, includingreshape(-1)
.
Topics that won’t be covered:
make_regression
, making random numbers in NumPyPolynomialFeatures
Nothing directly related to Matplotlib (but you should understand how a handwritten digit can be represented using a NumPy array).
Class lecture notebook¶
Most of the lecture was done at the whiteboard, so the following is just some miscellaneous parts that were done in Deepnote.
import pandas as pd
df = pd.DataFrame({0:range(5), 3:range(0,10,2), 1:range(0,15,3)})
df
0 | 3 | 1 | |
---|---|---|---|
0 | 0 | 0 | 0 |
1 | 1 | 2 | 3 |
2 | 2 | 4 | 6 |
3 | 3 | 6 | 9 |
4 | 4 | 8 | 12 |
df[1]
0 0
1 3
2 6
3 9
4 12
Name: 1, dtype: int64
df.iloc[1]
0 1
3 2
1 3
Name: 1, dtype: int64
There is no column named 2
, so the following does not work.
df[2]
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
3360 try:
-> 3361 return self._engine.get_loc(casted_key)
3362 except KeyError as err:
~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()
KeyError: 2
The above exception was the direct cause of the following exception:
KeyError Traceback (most recent call last)
/var/folders/8j/gshrlmtn7dg4qtztj4d4t_w40000gn/T/ipykernel_3398/2772902488.py in <module>
----> 1 df[2]
~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/core/frame.py in __getitem__(self, key)
3456 if self.columns.nlevels > 1:
3457 return self._getitem_multilevel(key)
-> 3458 indexer = self.columns.get_loc(key)
3459 if is_integer(indexer):
3460 indexer = [indexer]
~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
3361 return self._engine.get_loc(casted_key)
3362 except KeyError as err:
-> 3363 raise KeyError(key) from err
3364
3365 if is_scalar(key) and isna(key) and not self.hasnans:
KeyError: 2
df2 = pd.concat((df,df))
df2
0 | 3 | 1 | |
---|---|---|---|
0 | 0 | 0 | 0 |
1 | 1 | 2 | 3 |
2 | 2 | 4 | 6 |
3 | 3 | 6 | 9 |
4 | 4 | 8 | 12 |
0 | 0 | 0 | 0 |
1 | 1 | 2 | 3 |
2 | 2 | 4 | 6 |
3 | 3 | 6 | 9 |
4 | 4 | 8 | 12 |
df2.loc[1]
0 | 3 | 1 | |
---|---|---|---|
1 | 1 | 2 | 3 |
1 | 1 | 2 | 3 |
import seaborn as sns
df = sns.load_dataset("taxis").dropna()
X = df[["passengers", "distance"]]
df.color.unique()
array(['yellow', 'green'], dtype=object)
df.color.value_counts()
yellow 5373
green 968
Name: color, dtype: int64
df.color.value_counts().index
Index(['yellow', 'green'], dtype='object')
y = df["color"]
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression()
clf.fit(X,y)
LogisticRegression()
clf.score(X,y)
0.8473426904273774
(clf.predict(X)==y).mean()
0.8473426904273774
(clf.predict(X)==y).sum()/len(y)
0.8473426904273774
Above we fit on the entire data. That was just to save time. In theory we should always fit on a training subset.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,train_size=0.4)
X.shape
(6341, 2)
X_train
will contain 40% of the rows of X
.
X_train.shape
(2536, 2)
clf.fit(X_train, y_train)
LogisticRegression()
clf.score(X_train, y_train)
0.8568611987381703
clf.score(X_test, y_test)
0.8409986859395532
The test score is similar or better (higher) than the training score, so no evidence of overfitting. (It may change when executed again.)
Remember that score is the opposite of error. Higher score means better performance; lower error means better performance. The following statement is logically correct: “test error is lower than training error, so no evidence of overfitting”.