Midterm Review¶

Nothing scheduled today; ask questions!

Announcements¶

I have notecards at the front.
Midterm tomorrow during discussion section!
Yasmeen has posted both handwritten solutions and Deepnote solutions to the Sample Midterm. These solutions are at the top of the Week 8 page on Canvas.
The video quizzes should all be re-opened (taking the video quizzes now will not affect your grade). Feel free to attempt them as many times as you want. The links are on the individual Canvas Week pages, under the corresponding videos.
Friday I will introduce the course project. There will be a short homework due Tuesday Week 9 (posted soon!) asking you to find a possible dataset and topic for the course project.
Week 9 will introduce new Machine Learning material: decision trees and random forests
The last in-class quiz of the quarter will be Tuesday, Week 10.

Midterm topics¶

3 questions (multipart).

Most of the questions relate directly to Machine Learning.

Machine Learning in scikit-learn:

The usual procedure: import, create/instantiate, fit, predict/transform
StandardScaler
KMeans
LinearRegression, polynomial regression
LogisticRegression
train_test_split
mean_squared_error and mean_absolute_error
fit, predict, transform
coef_ and intercept_
predict_proba and score
Common errors (when does the data need to be two-dimensional?)

Machine Learning theory:

Unsupervised learning vs Supervised learning. Regression vs classification.
What is the K-Means clustering algorithm (i.e., what are its steps)?
When is rescaling data important?
Overfitting, underfitting (relationship to flexibility/polynomial regression, training set/test set, test error curve)
Interpretation of coefficients in linear regression (interpretation 1: which features are the most important? interpretation 2: what happens when we increase this value?)
Interpretation of coefficients in logistic regression in the binary case (how do these relate to probabilities?)
The definition of the sigmoid function; its relationship to logistic regression; why is it a natural choice of function for modeling a probability?
What is a decision boundary in logistic regression? Why is logistic regression in the sklearn.linear_model module?
Why shouldn’t we use linear regression for a classification problem?
In what sense are we finding the “best” coefficients in linear regression?
Which of the following is not like the others? accuracy, cost, error rate, loss
Which of the following is not like the others? feature, input, predictor, target
How do outliers impact the values of mean_absolute_error and mean_squared_error?
The MNIST dataset. (What are the dimensions? How do the values relate to images? Why is it considered a classification problem?)

One or two parts might be only indirectly related to Machine Learning. Most of those possible topics involve pandas:

Two ways to get datetime values in pandas: using the dt accessor or using map.
apply
groupby
merging two DataFrames
Feature Engineering, for example, adding columns representing Day of the Week with the Seattle bike data. (This is very much a Machine Learning concept, but the difficulty with it is implementing it using pandas.)
Selections in Altair (code like fields=["Close"]), using a selection with a condition
Displaying multiple Altair charts using vconcat
Using copy in pandas and in Altair
Reshaping a NumPy array using reshape, including reshape(-1).

Topics that won’t be covered:

make_regression, making random numbers in NumPy
PolynomialFeatures
Nothing directly related to Matplotlib (but you should understand how a handwritten digit can be represented using a NumPy array).

Class lecture notebook¶

Most of the lecture was done at the whiteboard, so the following is just some miscellaneous parts that were done in Deepnote.

import pandas as pd

df = pd.DataFrame({0:range(5), 3:range(0,10,2), 1:range(0,15,3)})
df

	0	3	1
0	0	0	0
1	1	2	3
2	2	4	6
3	3	6	9
4	4	8	12

df[1]

   0
   3
   6
   9
  12
Name: 1, dtype: int64

df.iloc[1]

  1
  2
  3
Name: 1, dtype: int64

There is no column named 2, so the following does not work.

df[2]

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   3360             try:
-> 3361                 return self._engine.get_loc(casted_key)
   3362             except KeyError as err:

~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()

KeyError: 2

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
/var/folders/8j/gshrlmtn7dg4qtztj4d4t_w40000gn/T/ipykernel_3398/2772902488.py in <module>
----> 1 df[2]

~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/core/frame.py in __getitem__(self, key)
   3456             if self.columns.nlevels > 1:
   3457                 return self._getitem_multilevel(key)
-> 3458             indexer = self.columns.get_loc(key)
   3459             if is_integer(indexer):
   3460                 indexer = [indexer]

~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   3361                 return self._engine.get_loc(casted_key)
   3362             except KeyError as err:
-> 3363                 raise KeyError(key) from err
   3364 
   3365         if is_scalar(key) and isna(key) and not self.hasnans:

KeyError: 2

df2 = pd.concat((df,df))

df2

	0	3	1
0	0	0	0
1	1	2	3
2	2	4	6
3	3	6	9
4	4	8	12
0	0	0	0
1	1	2	3
2	2	4	6
3	3	6	9
4	4	8	12

df2.loc[1]

	0	3	1
1	1	2	3
1	1	2	3

import seaborn as sns

df = sns.load_dataset("taxis").dropna()

X = df[["passengers", "distance"]]

df.color.unique()

array(['yellow', 'green'], dtype=object)

df.color.value_counts()

yellow    5373
green      968
Name: color, dtype: int64

df.color.value_counts().index

Index(['yellow', 'green'], dtype='object')

y = df["color"]

from sklearn.linear_model import LogisticRegression

clf = LogisticRegression()

clf.fit(X,y)

LogisticRegression()

clf.score(X,y)

0.8473426904273774

(clf.predict(X)==y).mean()

0.8473426904273774

(clf.predict(X)==y).sum()/len(y)

0.8473426904273774

Above we fit on the entire data. That was just to save time. In theory we should always fit on a training subset.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,y,train_size=0.4)

X.shape

(6341, 2)

X_train will contain 40% of the rows of X.

X_train.shape

(2536, 2)

clf.fit(X_train, y_train)

LogisticRegression()

clf.score(X_train, y_train)

0.8568611987381703

clf.score(X_test, y_test)

0.8409986859395532

The test score is similar or better (higher) than the training score, so no evidence of overfitting. (It may change when executed again.)

Remember that score is the opposite of error. Higher score means better performance; lower error means better performance. The following statement is logically correct: “test error is lower than training error, so no evidence of overfitting”.

UC Irvine Math 10 S22

Midterm Review

Contents

Midterm Review¶

Announcements¶

Midterm topics¶

Class lecture notebook¶