Lab 3 - Math 178, Spring 2024#
You are encouraged to work in groups of up to 3 total students, but each student should submit their own file. (It’s fine for everyone in the group to submit the same link.)
Put the full names of everyone in your group (even if you’re working alone) here. This makes grading easier.
Names:
The attached dataset world_cup22.csv
is based on this Kaggle dataset. To make it more well-suited to prediction, the values in most of the columns (everything from “possession” and beyond) corresponds to the team’s previous match.
Goal: Can we use statistics from a team’s previous match to predict how many goals they will score in the World Cup?
Comment: You should not expect excellent performance, because it is of course very difficult to predict how many goals a team will score. (And here we are not even considering the opponent, which is a hugely relevant piece of information.)
Prepare the data#
Read in the attached
world_cup22.csv
file and store it as a pandas DataFrame.
Rescale every column from “month” and beyond (i.e., every numeric column except for the “number of goals” column) to have mean (approximately) zero and standard deviation (approximately) one.
The most straightforward way to achieve this in my opinion is with the following code. You should replace ???
here with code that rescales the pandas Series col
.
df.loc[:, "month":] = df.loc[:, "month":].apply(lambda col: ???, axis=0)
Use
train_test_split
to divide the data into a training set of length60
a test set of length36
. Use therandom_state
keyword argument so that you get consistent results. HereX_train
andX_test
should be DataFrames with all of the rescaled numeric columns.y_train
andy_test
should be pandas Series containing only the “number of goals” column. (We will never use the “team” column in this lab; it’s just there for your own interest, and it was crucial to preparing the data, because we needed the statistics from the team’s previous match.)
Overfitting with linear regression#
Your intuition should be that linear regression is not very prone to overfitting, because it is not a particularly flexible model (we simply choose one coefficient per predictor and an intercept). However, when there are many features relative to the number of observations, then linear regression is indeed prone to overfitting. We will see that this is the case here.
Fit a scikit-learn
LinearRegression
object to the training data.What is the training Mean Squared Error?
What is the test Mean Squared Error?
Why does this suggest overfitting?
Aside: coefficient magnitudes#
Execute the following code to get a sense for which coefficients were deemed most important in our linear regression model. (The following code assumes reg
is the name of your fit LinearRegression
object. This code requires a recent version of Altair, which I’ve specified through the requirements.txt
file and the Initialization notebook. This should automatically be there for you if you duplicated this project. If not, you can recreate it by clicking the Python Machine button at the lower left, and selecting Initialization notebook
, and then copying what I have in the requirements.txt
file. Or if it’s causing trouble, just ignore all this and delete the .sort('x')
method in the below code.)
import altair as alt
df_coef = pd.DataFrame({
'feature': reg.feature_names_in_,
'coef': reg.coef_
})
chart = alt.Chart(df_coef).mark_bar().encode(
x='coef',
y=alt.Y('feature').sort('x') # delete `sort('x')` if necessary
).properties(
title='Coefficients'
)
chart
Comment: In my preparation for this lab, these coefficient sizes have been very unstable. That is another sign of overfitting. So if you do this twice with different random states, do not expect similar results.
Comment 2: These coefficient sizes would not be meaningful (in relation to each other) if we had not rescaled the input columns to have equal standard deviations.
Preparation for cross validation#
We saw above that using every column led to overfitting. Here we will restrict ourselves to only using three columns as predictors. But which three columns should we use?
Create a list of all possible length-3 tuples of column names from
X_train
. (Be sure your triples contain strings, not pandas Series. We want triples of column names, not triples of columns.)
Comment. I did this using the combinations
function from the itertools
module, and converting the resulting generator into a list (just by wrapping it in the list
function).
Comment 2. You would not want to do this for all possible length-10 tuples, because there would be too many. In this case, our list will have length 14,190, which is no concern.
Choose 1,000 of these triples randomly using
rng.choice
whererng
is a NumPydefault_rng
object; store the resulting NumPy array using the variable namerandom_tuples
. Use aseed
keyword argument todefault_rng
so that you get reproducible results. I recommend instantiatingrng
and callingrng.choice
in the same cell, because this will help reproducibility.
Comment. If you wanted to have for example length-10 tuples, then you should just do this step directly, without ever using itertools
.
Cross validation#
Overview: For each triple of features in random_tuples
, we will get an estimated Mean Squared Error using 10-fold cross-validation. We will choose as columns those which produce the best (i.e., lowest) MSE.
Use scikit-learn’s cross_validate
function to generate a list as follows.
Each entry in the list will be a length-2 tuple consisting of 1st the columns and 2nd the MSE cross-validation score.
Specify to use 10-fold cross-validation using the
cv
keyword argument.Specify
"neg_mean_squared_error"
as thescoring
keyword argument tocross_validate
.Do not use the full
X_train
incross_validate
. Instead only use the three columns intriple
.Compute the mean of the resulting
"test_score"
. This will be the negative of the mean of the MSEs, so negate it to get a traditional (positive) MSE.
mse_list = []
for triple in random_tuples:
cv_results = cross_validate(???)
cv_mse = ???
mse_list.append((tuple(triple), cv_mse))
Comment. This code took about two minutes to run when I tried it. If necessary you can decrease the number of triples used.
Comment 2. The tuple(triple)
is there to convert triple
from a NumPy array into a tuple
, so that it is hashable and is thus allowed to serve as a key to a dictionary.
Comment 3. Reality check: when I ran the code, these were the first two elements of mse_list
. The triples represent the columns used and the numbers represent the corresponding cross-validation Mean Squared Errors.
[(('central channel', 'inbetween offers to receive', 'red cards'),
1.228113034518691),
(('completed defensive line breaks', 'passes', 'crosses'),
1.3288270772012454)]
Convert
mse_list
to a dictionary and then to a pandas Series. (It did not work when I tried to immediately convert it to a dictionary.) Name the resultmse_series
.
What triple of columns (from the thousand we tested) produces the lowest cross-validation MSE? This is surprisingly easy to answer: use the
idxmin
method of a pandas Series. What is the MSE in that case?
Fit a
LinearRegression
object to the training data using only those three columns, and compute the training MSE and the test MSE.
Comment. One time when I tried this, there was still considerable overfitting, but not as much as above. Even with cross-validation, it is possible to overfit, especially when performing cross validation so many times. Another time I tried this, there was no overfitting. You should view the test MSE as the most reliable indicator of performance.
Submission#
Using the
Share
button at the top right, enable public sharing, and enable Comment privileges. Then submit the created link on Canvas.
Possible extensions#
My original plan for this dataset was to illustrate bootstrap as in Section 5 of this paper. I changed the plan because I decided cross-validation was a more fundamental concept.
No attempt was made here to find the “best” linear regression model. You could try using cross-validation to determine what is the ideal number of (and collection of) predictors. Be sure to evaluate the ultimate performance on an unseen test set.
You could try doing basically the exact same thing except using KNN regression instead of linear regression. (Here you will benefit from our having rescaled the predictor columns.) Try using cross-validation to select the ideal columns and the ideal value of K (the number of neighbors considered). I haven’t tried this so there is some chance something goes wrong, due to being in such a high-dimensional space (i.e., with so many columns), but I think it will be fine.