Lab 3 - Math 178, Spring 2024

Lab 3 - Math 178, Spring 2024#

You are encouraged to work in groups of up to 3 total students, but each student should submit their own file. (It’s fine for everyone in the group to submit the same link.)

Put the full names of everyone in your group (even if you’re working alone) here. This makes grading easier.

Names:

The attached dataset world_cup22.csv is based on this Kaggle dataset. To make it more well-suited to prediction, the values in most of the columns (everything from “possession” and beyond) corresponds to the team’s previous match.

Goal: Can we use statistics from a team’s previous match to predict how many goals they will score in the World Cup?

Comment: You should not expect excellent performance, because it is of course very difficult to predict how many goals a team will score. (And here we are not even considering the opponent, which is a hugely relevant piece of information.)

Prepare the data#

Read in the attached world_cup22.csv file and store it as a pandas DataFrame.

Rescale every column from “month” and beyond (i.e., every numeric column except for the “number of goals” column) to have mean (approximately) zero and standard deviation (approximately) one.

The most straightforward way to achieve this in my opinion is with the following code. You should replace ??? here with code that rescales the pandas Series col.

df.loc[:, "month":] = df.loc[:, "month":].apply(lambda col: ???, axis=0)

Use train_test_split to divide the data into a training set of length 60 a test set of length 36. Use the random_state keyword argument so that you get consistent results. Here X_train and X_test should be DataFrames with all of the rescaled numeric columns. y_train and y_test should be pandas Series containing only the “number of goals” column. (We will never use the “team” column in this lab; it’s just there for your own interest, and it was crucial to preparing the data, because we needed the statistics from the team’s previous match.)

Overfitting with linear regression#

Your intuition should be that linear regression is not very prone to overfitting, because it is not a particularly flexible model (we simply choose one coefficient per predictor and an intercept). However, when there are many features relative to the number of observations, then linear regression is indeed prone to overfitting. We will see that this is the case here.

Fit a scikit-learn LinearRegression object to the training data.
What is the training Mean Squared Error?
What is the test Mean Squared Error?
Why does this suggest overfitting?

Aside: coefficient magnitudes#

Execute the following code to get a sense for which coefficients were deemed most important in our linear regression model. (The following code assumes reg is the name of your fit LinearRegression object. This code requires a recent version of Altair, which I’ve specified through the requirements.txt file and the Initialization notebook. This should automatically be there for you if you duplicated this project. If not, you can recreate it by clicking the Python Machine button at the lower left, and selecting Initialization notebook, and then copying what I have in the requirements.txt file. Or if it’s causing trouble, just ignore all this and delete the .sort('x') method in the below code.)

import altair as alt

df_coef = pd.DataFrame({
    'feature': reg.feature_names_in_,
    'coef': reg.coef_
})

chart = alt.Chart(df_coef).mark_bar().encode(
    x='coef',
    y=alt.Y('feature').sort('x') # delete `sort('x')` if necessary
).properties(
    title='Coefficients'
)
chart

Comment: In my preparation for this lab, these coefficient sizes have been very unstable. That is another sign of overfitting. So if you do this twice with different random states, do not expect similar results.

Comment 2: These coefficient sizes would not be meaningful (in relation to each other) if we had not rescaled the input columns to have equal standard deviations.

Preparation for cross validation#

We saw above that using every column led to overfitting. Here we will restrict ourselves to only using three columns as predictors. But which three columns should we use?

Create a list of all possible length-3 tuples of column names from X_train. (Be sure your triples contain strings, not pandas Series. We want triples of column names, not triples of columns.)

Comment. I did this using the combinations function from the itertools module, and converting the resulting generator into a list (just by wrapping it in the list function).

Comment 2. You would not want to do this for all possible length-10 tuples, because there would be too many. In this case, our list will have length 14,190, which is no concern.

Choose 1,000 of these triples randomly using rng.choice where rng is a NumPy default_rng object; store the resulting NumPy array using the variable name random_tuples. Use a seed keyword argument to default_rng so that you get reproducible results. I recommend instantiating rng and calling rng.choice in the same cell, because this will help reproducibility.

Comment. If you wanted to have for example length-10 tuples, then you should just do this step directly, without ever using itertools.

Cross validation#

Overview: For each triple of features in random_tuples, we will get an estimated Mean Squared Error using 10-fold cross-validation. We will choose as columns those which produce the best (i.e., lowest) MSE.

Use scikit-learn’s cross_validate function to generate a list as follows.

Each entry in the list will be a length-2 tuple consisting of 1st the columns and 2nd the MSE cross-validation score.
Specify to use 10-fold cross-validation using the cv keyword argument.
Specify "neg_mean_squared_error" as the scoring keyword argument to cross_validate.
Do not use the full X_train in cross_validate. Instead only use the three columns in triple.
Compute the mean of the resulting "test_score". This will be the negative of the mean of the MSEs, so negate it to get a traditional (positive) MSE.

mse_list = []

for triple in random_tuples:
    cv_results = cross_validate(???)
    cv_mse = ???
    mse_list.append((tuple(triple), cv_mse))

Comment. This code took about two minutes to run when I tried it. If necessary you can decrease the number of triples used.

Comment 2. The tuple(triple) is there to convert triple from a NumPy array into a tuple, so that it is hashable and is thus allowed to serve as a key to a dictionary.

Comment 3. Reality check: when I ran the code, these were the first two elements of mse_list. The triples represent the columns used and the numbers represent the corresponding cross-validation Mean Squared Errors.

[(('central channel', 'inbetween offers to receive', 'red cards'),
  1.228113034518691),
 (('completed defensive line breaks', 'passes', 'crosses'),
  1.3288270772012454)]

Convert mse_list to a dictionary and then to a pandas Series. (It did not work when I tried to immediately convert it to a dictionary.) Name the result mse_series.

What triple of columns (from the thousand we tested) produces the lowest cross-validation MSE? This is surprisingly easy to answer: use the idxmin method of a pandas Series. What is the MSE in that case?

Fit a LinearRegression object to the training data using only those three columns, and compute the training MSE and the test MSE.

Comment. One time when I tried this, there was still considerable overfitting, but not as much as above. Even with cross-validation, it is possible to overfit, especially when performing cross validation so many times. Another time I tried this, there was no overfitting. You should view the test MSE as the most reliable indicator of performance.

Submission#

Using the Share button at the top right, enable public sharing, and enable Comment privileges. Then submit the created link on Canvas.

Possible extensions#

My original plan for this dataset was to illustrate bootstrap as in Section 5 of this paper. I changed the plan because I decided cross-validation was a more fundamental concept.
No attempt was made here to find the “best” linear regression model. You could try using cross-validation to determine what is the ideal number of (and collection of) predictors. Be sure to evaluate the ultimate performance on an unseen test set.
You could try doing basically the exact same thing except using KNN regression instead of linear regression. (Here you will benefit from our having rescaled the predictor columns.) Try using cross-validation to select the ideal columns and the ideal value of K (the number of neighbors considered). I haven’t tried this so there is some chance something goes wrong, due to being in such a high-dimensional space (i.e., with so many columns), but I think it will be fine.

Created in Deepnote