Worksheet 12
Contents
Worksheet 12#
Plots of polynomial regression curves#
Write a function polyreg_plot
which takes four inputs: a pandas DataFrame df_data
(which can be assumed to have two columns “x” and “y”), an integer degree
with the default value of 1
, an integer num_rows
with the default value of -1
, and an integer state
with the default value of 0
. The function should do the following.
If
num_rows
is-1
, the function should setnum_rows
to the number of rows ofdf_data
. (This assignment should go at the top of the body of the function. Use an if statement.)Select
num_rows
random rows fromdf_data
and definedf
to be a copy of these rows. Use the pandas DataFramesample
method with the keyword argumentrandom_state=state
to select these random rows. Use thecopy
method to make the copy.For each integer value of
d
from1
todegree
, includingdegree
, add a new column “xd” todf
containing the d-th power of the “x” column. For example, whend=2
, the column will be named “x2” and will be equal todf["x"]**2
. Comment. If you’d rather usePolynomialFeatures
instead of the “by-hand” method described here, that’s great. I wrote it this way in case we didn’t get toPolynomialFeatures
.Fit a scikit-learn LinearRegression object to this data, using columns “x1” to “xdegree” as the predictor columns and using “y” as the target column. For example, if
degree=3
, then we would be using the columns “x1”, “x2”, “x3”. (I believe it’s best to make all of the imports outside of the function, so likefrom sklearn.linear_model ...
would be written outside of the function, probably in an earlier cell. The instantiate step,reg = LinearRegression()
would be written inside the function.)Using NumPy’s
arange
function, make a one-dimensional NumPy array containing the values -8, -7.9, -7.8, …, 7.8, 7.9, and then reshape this array so it has the same length and one column (using.reshape(???,???)
… you should not need to type any numbers intoreshape
other than \(\pm 1\)). Name the resulting “column matrix” asX1
.Using list comprehension, make a list called
Xlist
containingX1
,X1**2
, …,X1**degree
. Concatenate these side-by-side into a NumPy arrayX
usingX = np.concatenate(Xlist, axis=???)
. Comment. Again, you can replace this step withPolynomialFeatures
if you want.Convert
X
into a pandas DataFramedfX
with column names “x1” to “xdegree” as above. This DataFrame should have 160 rows and degree columns.Use the Linear Regression object you fit above (don’t fit it again) to make predictions corresponding to the input
dfX
. Name the resulting array of predictionsypred
. This NumPy arrayypred
should have length 160.Using Matplotlib as in last week’s videos, make a Matplotlib Axes object containing a scatter-plot of the (x,y) points from
df
(useax.scatter
) and containing a traditional plot (a line chart, useax.plot
) of points (x1, y1) where x1 runs through the “x1” column ofdfX
and where y1 is the corresponding predicted value fromypred
. Set the color of the line plot to red by using'r'
as the format specification. (Even if you’re an experienced Matplotlib user, please check the syntax from the video, because I want us to use the Object Oriented Matplotlib syntax, for example, usingplt.subplots
andax.scatter
instead ofplt.scatter
.)The function should, as output, return the Matplotlib Figure object. (As I have it written, the function both shows the plot and also returns the figure object, so don’t worry if the figure shows up twice when you call the function. I’m not sure how to remove the extra display.)
Submission instructions#
Before evaluating the function, set a Matplotlib style using
plt.style.use(???)
and one of the choices fromplt.style.available
. I think it makes the most sense to do this outside of the function body.Run the function with
df_data
as the attachedsim_data.csv
(read it in usingpd.read_csv
as usual), withnum_rows=30
, withdegree=7
, and withstate
as your student id number (if you are in a group, just choose one of your student id numbers.)Save the resulting Matplotlib Figure object as a pickle file named wkst12-ans.pickle, as in the Worksheet 1 instructions, and upload that pickle file to Canvas.
If you want to check that your figure has been saved correctly, create a new notebook (or refresh this one) and run the following code (let me know if you see an easier approach):
import matplotlib.pyplot as plt
import pickle
plt.show()
with open("wkst12-ans.pickle", "rb") as f:
fig = pickle.load(f)
fig
Important for Midterm 2#
Not to be turned in but important for the next midterm. How do the predicted plots (the red curves) change as we use higher degrees? How do they change as we use smaller or larger subsets of the data (i.e., as we use fewer rows or more rows)?
Reminder#
Every group member needs to submit this on Canvas (even if you all submit the same file).
Submission#
Submit the pickle file on Canvas, as described in the instructions above.