Worksheet 11

Worksheet 11#

You are encouraged to work in groups of up to 3 total students, but each student should make their own submission on Canvas. (It’s fine for everyone in the group to have the same upload.)

Put the full names of everyone in your group (even if you’re working alone) here. (This makes grading easier.)

Names:

Generating linear data with an outlier#

Generate artificial data for linear regression using the make_regression function from sklearn.datasets. Name the returned input values X and the returned target values y, like usual.

Specify the following parameters.

Use 20 data points.
Use an intercept of -5.
Use 15 for the standard deviation of the random noise.
Use only one input variable.
Use a random_state keyword argument so that you have reproducible results.
Do not return the true underlying coefficient.

Put this data into a pandas DataFrame with columns named "x" and "y", and plot the data using an Altair scatter plot. It should look clearly linear. (If it doesn’t look clearly linear, then try a different random_state value.)

In this part we will add a row to the DataFrame corresponding to an “outlier”. Here are the details.

Add a row to the bottom of the pandas DataFrame corresponding to the point \((-1, 450)\). (Notice how this point is approximately 30 standard deviations above the “true” line.) One option for how to add this row is to execute df.loc[len(df)] = ???, where ??? gets replaced by the point you want to add. Another option is to make a pandas Series new_pt corresponding to the point you want to add, and then use df = df.append(new_pt, ignore_index=True). (The argument ignore_index enables us to append a pandas Series without first converting it to a DataFrame.)

Warning. Don’t append the point multiple times; just append it once. You can call df.tail() to see the last 5 rows of df and make sure it doesn’t have repetitions.

Plot the data again using Altair. The new “outlier” point should be very visible. Name the resulting Altair Chart c1.

The line of best fit using Mean Squared Error#

Using the LinearRegression class in scikit-learn, find the line of “best” fit for the data. (We put “best” in quotation marks because we will find another “best” line later.)

Add a new column to the DataFrame named "pred_mse", which contains the result of calling the predict method of the LinearRegression class on the "x" column in df. The letters “mse” here stand for “Mean Squared Error”, which is the loss function used by the LinearRegression class.

Make a new Altair chart c2 containing the same x values as above but with pred_mse for the y-values. Use mark_line instead of mark_circle. Make the color of this line red, either by using color=alt.value("red") or by using mark_line(color="red").

The line of best fit using Mean Absolute Error#

Here we find another line of “best” fit, this time using Mean Absolute Error instead of Mean Squared Error.

Import the SGDRegressor class from sklearn.linear_model (the same place we import `LinearRegression from).

Instantiate an instance of this class. We have not had to change any parameters when creating a LinearRegression index, so we have just used LinearRegression() with nothing inside the parentheses. For SGDRegressor, on the other hand, use the following keyword arguments.

loss="epsilon_insensitive"
epsilon=0

You may find later that you also need to increase the max_iter value, for example, max_iter=5000 or even higher.

Don’t worry about what these specific arguments mean, just know that the resulting object will seek to minimize Mean Absolute Error instead of Mean Squared Error.

Fit this SGDRegressor object using the "x" column from df as the input (as a one-column DataFrame, not as a pandas Series) and using the "y" column as the target.

If you get a ConvergenceWarning, then increase max_iter above. Keep increasing it until you do not get a warning. (If something else goes wrong, try changing the random_state value at the very beginning.)

Add a new column to the DataFrame named "pred_mae", which contains the result of calling the predict method of the SGDRegressor class on the "x" column in df. The letters “mae” stand for “Mean Absolute Error”, which is the loss function used here.

Make a new Altair chart c3 containing the same x values as above but with pred_mse for the y-values. Use mark_line again. Make the color of this line black.

Comparing the lines#

Plot the three Altair charts layered together, by evaluating c1+c2+c3.

Which of the two lines seems to be more heavily influenced by the outlier point? Answer in a markdown cell (not in a Python comment).

Evaluating the results#

Import the functions mean_squared_error and mean_absolute_error from sklearn.metrics.

Using the mean_squared_error function, evaluate the mean squared error between df["y"] and df["pred_mse"]. Do the same thing for df["y"] and df["pred_mae"].

Compute one of these same two numbers “by hand”, i.e., without using mean_squared_error. (Hint. Make a pandas Series containing the squares of the differences, then call the mean() method.)

Which of the two predictions is better according to the mean_squared_error metric? Answer in a markdown cell.

Using the mean_absolute_error function, evaluate the mean absolute error between df["y"] and df["pred_mse"]. Do the same thing for df["y"] and df["pred_mae"].

Compute one of these same two numbers “by hand”, i.e., without using mean_absolute_error. (Hint. A pandas Series has an abs method.)

Which of the two predictions is better according to the mean_absolute_error metric? Answer in a markdown cell.

Submission#

Reminder: everyone needs to make a submission on Canvas.
Reminder: include everyone’s full name at the top, after Names.
Using the Share button at the top right, enable public sharing, and enable Comment privileges. Then submit the created link on Canvas.

Created in Deepnote