Midterm Review

Nothing scheduled today; ask questions!

Announcements

  • I have notecards at the front.

  • Midterm tomorrow during discussion section!

  • Yasmeen has posted both handwritten solutions and Deepnote solutions to the Sample Midterm. These solutions are at the top of the Week 8 page on Canvas.

  • The video quizzes should all be re-opened (taking the video quizzes now will not affect your grade). Feel free to attempt them as many times as you want. The links are on the individual Canvas Week pages, under the corresponding videos.

  • Friday I will introduce the course project. There will be a short homework due Tuesday Week 9 (posted soon!) asking you to find a possible dataset and topic for the course project.

  • Week 9 will introduce new Machine Learning material: decision trees and random forests

  • The last in-class quiz of the quarter will be Tuesday, Week 10.

Midterm topics

3 questions (multipart).

Most of the questions relate directly to Machine Learning.

Machine Learning in scikit-learn:

  • The usual procedure: import, create/instantiate, fit, predict/transform

  • StandardScaler

  • KMeans

  • LinearRegression, polynomial regression

  • LogisticRegression

  • train_test_split

  • mean_squared_error and mean_absolute_error

  • fit, predict, transform

  • coef_ and intercept_

  • predict_proba and score

  • Common errors (when does the data need to be two-dimensional?)

Machine Learning theory:

  • Unsupervised learning vs Supervised learning. Regression vs classification.

  • What is the K-Means clustering algorithm (i.e., what are its steps)?

  • When is rescaling data important?

  • Overfitting, underfitting (relationship to flexibility/polynomial regression, training set/test set, test error curve)

  • Interpretation of coefficients in linear regression (interpretation 1: which features are the most important? interpretation 2: what happens when we increase this value?)

  • Interpretation of coefficients in logistic regression in the binary case (how do these relate to probabilities?)

  • The definition of the sigmoid function; its relationship to logistic regression; why is it a natural choice of function for modeling a probability?

  • What is a decision boundary in logistic regression? Why is logistic regression in the sklearn.linear_model module?

  • Why shouldn’t we use linear regression for a classification problem?

  • In what sense are we finding the “best” coefficients in linear regression?

  • Which of the following is not like the others? accuracy, cost, error rate, loss

  • Which of the following is not like the others? feature, input, predictor, target

  • How do outliers impact the values of mean_absolute_error and mean_squared_error?

  • The MNIST dataset. (What are the dimensions? How do the values relate to images? Why is it considered a classification problem?)

One or two parts might be only indirectly related to Machine Learning. Most of those possible topics involve pandas:

  • Two ways to get datetime values in pandas: using the dt accessor or using map.

  • apply

  • groupby

  • merging two DataFrames

  • Feature Engineering, for example, adding columns representing Day of the Week with the Seattle bike data. (This is very much a Machine Learning concept, but the difficulty with it is implementing it using pandas.)

  • Selections in Altair (code like fields=["Close"]), using a selection with a condition

  • Displaying multiple Altair charts using vconcat

  • Using copy in pandas and in Altair

  • Reshaping a NumPy array using reshape, including reshape(-1).

Topics that won’t be covered:

  • make_regression, making random numbers in NumPy

  • PolynomialFeatures

  • Nothing directly related to Matplotlib (but you should understand how a handwritten digit can be represented using a NumPy array).

Class lecture notebook

Most of the lecture was done at the whiteboard, so the following is just some miscellaneous parts that were done in Deepnote.

import pandas as pd
df = pd.DataFrame({0:range(5), 3:range(0,10,2), 1:range(0,15,3)})
df
0 3 1
0 0 0 0
1 1 2 3
2 2 4 6
3 3 6 9
4 4 8 12
df[1]
0     0
1     3
2     6
3     9
4    12
Name: 1, dtype: int64
df.iloc[1]
0    1
3    2
1    3
Name: 1, dtype: int64

There is no column named 2, so the following does not work.

df[2]
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   3360             try:
-> 3361                 return self._engine.get_loc(casted_key)
   3362             except KeyError as err:

~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()

KeyError: 2

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
/var/folders/8j/gshrlmtn7dg4qtztj4d4t_w40000gn/T/ipykernel_3398/2772902488.py in <module>
----> 1 df[2]

~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/core/frame.py in __getitem__(self, key)
   3456             if self.columns.nlevels > 1:
   3457                 return self._getitem_multilevel(key)
-> 3458             indexer = self.columns.get_loc(key)
   3459             if is_integer(indexer):
   3460                 indexer = [indexer]

~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   3361                 return self._engine.get_loc(casted_key)
   3362             except KeyError as err:
-> 3363                 raise KeyError(key) from err
   3364 
   3365         if is_scalar(key) and isna(key) and not self.hasnans:

KeyError: 2
df2 = pd.concat((df,df))
df2
0 3 1
0 0 0 0
1 1 2 3
2 2 4 6
3 3 6 9
4 4 8 12
0 0 0 0
1 1 2 3
2 2 4 6
3 3 6 9
4 4 8 12
df2.loc[1]
0 3 1
1 1 2 3
1 1 2 3
import seaborn as sns
df = sns.load_dataset("taxis").dropna()
X = df[["passengers", "distance"]]
df.color.unique()
array(['yellow', 'green'], dtype=object)
df.color.value_counts()
yellow    5373
green      968
Name: color, dtype: int64
df.color.value_counts().index
Index(['yellow', 'green'], dtype='object')
y = df["color"]
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression()
clf.fit(X,y)
LogisticRegression()
clf.score(X,y)
0.8473426904273774
(clf.predict(X)==y).mean()
0.8473426904273774
(clf.predict(X)==y).sum()/len(y)
0.8473426904273774

Above we fit on the entire data. That was just to save time. In theory we should always fit on a training subset.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,train_size=0.4)
X.shape
(6341, 2)

X_train will contain 40% of the rows of X.

X_train.shape
(2536, 2)
clf.fit(X_train, y_train)
LogisticRegression()
clf.score(X_train, y_train)
0.8568611987381703
clf.score(X_test, y_test)
0.8409986859395532

The test score is similar or better (higher) than the training score, so no evidence of overfitting. (It may change when executed again.)

Remember that score is the opposite of error. Higher score means better performance; lower error means better performance. The following statement is logically correct: “test error is lower than training error, so no evidence of overfitting”.