Lab 1 - Math 178, Spring 2024

Lab 1 - Math 178, Spring 2024#

This lab is due Thursday night of Week 2. You are encouraged to work in groups of up to 3 total students, but each student should submit their own file. (It’s fine for everyone in the group to submit the same link.)

The goal of this lab is to produce a plot like what is shown in Figures 2.9, 2.10, 2.11, and 2.17 in the Introduction to Statistical Learning with Applications in Python textbook.

Put the full names of everyone in your group (even if you’re working alone) here. This makes grading easier.

Names:

Generate the data#

Our true underlying function will be \(f(x) = 3x^2\).

Create a 2000-by-2 pandas DataFrame with two columns, "x" and "y". The x-column should contain 2000 random values distributed uniformly between -5 and 5. The y-column should should be defined using \(y = f(x) + \epsilon\), where \(\epsilon\) represents Gaussian random noise with mean 0. You can experiment with different standard deviations for this random noise (to set the standard deviation, use the scale keyword argument in NumPy).

Plot the data#

Draw a scatter-plot of this data. Chris recommends using Altair (and can best help if you use Altair), but you are welcome to use whatever you like, including Plotly, Seaborn, or Matplotlib.

A function to compute train error and test error#

Write a function get_error that takes three inputs, train_size, k, and set_used. Descriptions of these arguments:

train_size represents the size of the training set to use as an integer. (The train_test_split function also allows a decimal between 0 and 1, but we want to specify the absolute number of rows to use.)
k represents the number of neighbors to use.
set_used will be the string "train" or "test", and indicates whether we are computing the training error or the test error.

Within the function:

Divide the data into a training set and a test set using train_test_split. Be sure to choose the number of training rows using the train_size argument.
Instantiate a KNN object from scikit-learn. (Is this a regression problem or a classification problem?)
Fit the object to the training data. (Fitting to all the data or to the test data is a major mistake.)
Compute the train mean-squared error or the test mean-squared error, according to the set_used argument.
Return this MSE.

Plot the results#

Experiment with different values of train_size and k with the goal of making a plot similar to what is in Figure 2.17 in the textbook. If you use Altair, you can have a log scale along the x-axis as shown here: https://altair-viz.github.io/gallery/line_with_log_scale.html (Warning: Deepnote does not have the latest version of Altair pre-installed, so you will probably need to use the attribute syntax, not the method syntax.) For showing both curves together in Altair, I followed the IMDB example here, but it might be simpler to just make the train curve and the test curve separately, and then layer them using +.
Using a log scale is not a requirement, but in my case it made the charts look better.
Be sure to use k-inverse rather than k for the x-axis, so that more flexible values (where overfitting is more likely) occur to the right of the chart. That is the general convention for these charts.
If your chart doesn’t look at least approximately like what is shown in Figure 2.17, try changing parameters (including the standard deviation of the error from the very beginning of this lab) or check for mistakes.

Submission#

Using the Share button at the top right, enable public sharing, and enable Comment privileges. Then submit the created link on Canvas.

Possible extensions#

These are not required but some ideas for extra practice.

Our chart is like the right-hand panel of Figures 2.9 to 2.11. Can you also make the left-hand panel?
Conceptually harder but also using the basic functionality of the get_error function: Can you make something like one of the charts shown in Figure 2.12? I don’t think this will be possible using the information in ISLP, so you will need to look up the definition of bias and variance somewhere else. (They involve averaging over many choices of equal-sized training sets. It will not be practical to use every possible choice of training set, because there are too many.) I haven’t tried this myself so I’m not sure how similar the outcome will be to what’s shown in Figure 2.12.

Created in Deepnote