Worksheet 11#
You are encouraged to work in groups of up to 3 total students, but each student should make their own submission on Canvas. (It’s fine for everyone in the group to have the same upload.)
Put the full names of everyone in your group (even if you’re working alone) here. (This makes grading easier.)
Names:
Generating linear data with an outlier#
Generate artificial data for linear regression using the make_regression
function from sklearn.datasets
. Name the returned input values X
and the returned target values y
, like usual.
Specify the following parameters.
Use
20
data points.Use an intercept of
-5
.Use
15
for the standard deviation of the random noise.Use only one input variable.
Use a
random_state
keyword argument so that you have reproducible results.Do not return the true underlying coefficient.
Put this data into a pandas DataFrame with columns named
"x"
and"y"
, and plot the data using an Altair scatter plot. It should look clearly linear. (If it doesn’t look clearly linear, then try a differentrandom_state
value.)
In this part we will add a row to the DataFrame corresponding to an “outlier”. Here are the details.
Add a row to the bottom of the pandas DataFrame corresponding to the point \((-1, 450)\). (Notice how this point is approximately 30 standard deviations above the “true” line.) One option for how to add this row is to execute
df.loc[len(df)] = ???
, where???
gets replaced by the point you want to add. Another option is to make a pandas Seriesnew_pt
corresponding to the point you want to add, and then usedf = df.append(new_pt, ignore_index=True)
. (The argumentignore_index
enables us to append a pandas Series without first converting it to a DataFrame.)
Warning. Don’t append the point multiple times; just append it once. You can call df.tail()
to see the last 5 rows of df
and make sure it doesn’t have repetitions.
Plot the data again using Altair. The new “outlier” point should be very visible. Name the resulting Altair Chart
c1
.
The line of best fit using Mean Squared Error#
Using the
LinearRegression
class in scikit-learn, find the line of “best” fit for the data. (We put “best” in quotation marks because we will find another “best” line later.)
Add a new column to the DataFrame named
"pred_mse"
, which contains the result of calling thepredict
method of theLinearRegression
class on the"x"
column indf
. The letters “mse” here stand for “Mean Squared Error”, which is the loss function used by theLinearRegression
class.
Make a new Altair chart
c2
containing the samex
values as above but withpred_mse
for they
-values. Usemark_line
instead ofmark_circle
. Make the color of this line red, either by usingcolor=alt.value("red")
or by usingmark_line(color="red")
.
The line of best fit using Mean Absolute Error#
Here we find another line of “best” fit, this time using Mean Absolute Error instead of Mean Squared Error.
Import the
SGDRegressor
class fromsklearn.linear_model
(the same place we import `LinearRegression from).
Instantiate an instance of this class. We have not had to change any parameters when creating a LinearRegression
index, so we have just used LinearRegression()
with nothing inside the parentheses. For SGDRegressor
, on the other hand, use the following keyword arguments.
loss="epsilon_insensitive"
epsilon=0
You may find later that you also need to increase the max_iter
value, for example, max_iter=5000
or even higher.
Don’t worry about what these specific arguments mean, just know that the resulting object will seek to minimize Mean Absolute Error instead of Mean Squared Error.
Fit this
SGDRegressor
object using the"x"
column fromdf
as the input (as a one-column DataFrame, not as a pandas Series) and using the"y"
column as the target.
If you get a ConvergenceWarning
, then increase max_iter
above. Keep increasing it until you do not get a warning. (If something else goes wrong, try changing the random_state
value at the very beginning.)
Add a new column to the DataFrame named
"pred_mae"
, which contains the result of calling thepredict
method of theSGDRegressor
class on the"x"
column indf
. The letters “mae” stand for “Mean Absolute Error”, which is the loss function used here.
Make a new Altair chart
c3
containing the samex
values as above but withpred_mse
for they
-values. Usemark_line
again. Make the color of this line black.
Comparing the lines#
Plot the three Altair charts layered together, by evaluating
c1+c2+c3
.
Which of the two lines seems to be more heavily influenced by the outlier point? Answer in a markdown cell (not in a Python comment).
Evaluating the results#
Import the functions
mean_squared_error
andmean_absolute_error
fromsklearn.metrics
.
Using the
mean_squared_error
function, evaluate the mean squared error betweendf["y"]
anddf["pred_mse"]
. Do the same thing fordf["y"]
anddf["pred_mae"]
.
Compute one of these same two numbers “by hand”, i.e., without using
mean_squared_error
. (Hint. Make a pandas Series containing the squares of the differences, then call themean()
method.)
Which of the two predictions is better according to the
mean_squared_error
metric? Answer in a markdown cell.
Using the
mean_absolute_error
function, evaluate the mean absolute error betweendf["y"]
anddf["pred_mse"]
. Do the same thing fordf["y"]
anddf["pred_mae"]
.
Compute one of these same two numbers “by hand”, i.e., without using
mean_absolute_error
. (Hint. A pandas Series has anabs
method.)
Which of the two predictions is better according to the
mean_absolute_error
metric? Answer in a markdown cell.
Submission#
Reminder: everyone needs to make a submission on Canvas.
Reminder: include everyone’s full name at the top, after Names.
Using the
Share
button at the top right, enable public sharing, and enable Comment privileges. Then submit the created link on Canvas.