Worksheet 10#
You are encouraged to work in groups of up to 3 total students, but each student should make their own submission on Canvas. (It’s fine for everyone in the group to have the same upload.)
Put the full names of everyone in your group (even if you’re working alone) here. (This makes grading easier.)
Names:
Part 1: Predicting MPG using one feature#
Load the mpg dataset from Seaborn and store it with the variable name
df
.
Drop the rows that contain missing values from the dataset, using
df.dropna(axis=???)
. Save the resulting DataFrame also asdf
, either usingdf = df.dropna(axis=???)
or by usingdf.dropna(axis=???, inplace=???)
(nodf=
in this second one).
Using Altair, make a scatter plot with “horsepower” along the x-axis and with “mpg” along the y-axis.
Based on this chart, answer the following questions in a markdown cell.
In a line of best fit \(\text{mpg} \approx m \cdot \text{horsepower} + b\), based on the above chart, what would you estimate for \(m\) and \(b\)? (Just a rough estimate is fine.)
In particular, does the chart suggest \(m\) is positive or negative? Does that make sense intuitively, from the real-world meaning of this data?
Does a straight line seem to be a good fit for this data? Write about one sentence giving your opinion. (I don’t think there is a clear correct answer for this, just tell me how it seems to you.)
Be sure you are answering in a markdown cell, not in a Python comment.
Find values of \(m\) and \(b\) (in the notation above) using the
LinearRegression
class fromsklearn.linear_model
. Assign these values to variablesm
andb
.
Verify that
m
is actually a float, not a NumPy array, usingtype
. If not, then fix it.
Hint. If arr
is a NumPy array containing just a single element in it, you can access that single element using the item
method. Or you could just use arr[0]
.
Part 2: Predicting MPG using multiple features#
Now we will use four input variables rather than just one.
Define a list
pred_cols
containing the names of the four input columns we will use.
pred_cols = ["horsepower", "weight", "model_year", "cylinders"]
Fit a new
LinearRegression
object using these four columns as input features and again using “mpg” as the target value. Store the resultingLinearRegression
object with the variable namereg2
.
Evaluate
reg2.coef_
. It is a little annoying that we can’t immediately tell which number corresponds to which variable.
Here is a nice way to group the feature names with the values in a pandas Series. Execute the following code, after filling in an appropriate value for the
index
.
ser = pd.Series(reg2.coef_, index=???)
I’m expecting you to use pred_cols
, but another option that might be more reliable is to use reg2.feature_names_in_
.
Examine the pandas Series you made above and answer the following questions in a markdown cell.
Only one of the four coefficients is positive. Explain why it makes sense that particular coefficient is positive, in terms of the real-world meaning of this data.
Imagine a car from 1999 weighing 3800 pounds, with 200 horsepower, and with 4 cylinders. What would our model estimate for the mpg for this car? (Hint 1. Notice that a year like
1974
is not written as1974
in the DataFrame, so don’t use1999
as one of your inputs. Hint 2. Don’t forget the bias/intercept.)The horsepower coefficient you computed is quite different from the one computed in Part 1. Why are these numbers so different? (Idea: In Part 1, we are learning how mpg changes when horsepower increases. In this part, we are learning how mpg changes when horsepower increases and these other values stay fixed. Why would you expect that to make a difference?)
Part 3: Estimating MPG for a hypothetical car#
We will check if your computation above was correct for the 1999 car.
Define
ser
to be a copy of the top row ofdf
. (In other words, get the top row ofdf
however you normally would, and then call.copy()
at the end of your code. This way, you can make changes to this Series without affectingdf
.)
Define a length-4 Python dictionary
dct
with keys “weight”, “cylinders”, “horsepower”, and “model_year” and with values matching what was used in the “imagine a car” question above. For example,
dct = {"weight": 3800, ...}
Call the
update
method ofser
and pass itdct
as an argument.
Evaluate
ser
. Notice that the values have been changed to match our hypothetical car.
Make a length-1 list whose only element is the pandas Series
ser
.
Pass that length-1 list to
pd.DataFrame
to construct a one row DataFrame. Assign the result to the variabledf2
.
Evaluate
df2
. You should see a length-1 DataFrame, with values matching those of our hypothetical car (and some leftover columns from the original row).
Use
reg2.predict
to find the predicted mpg corresponding to this hypothetical car.
Hint. Remember that we are only using the columns from the list pred_cols
.
What error message do we get if you try to do the same thing but using
ser[pred_cols]
instead ofdf2[pred_cols]
?
What does the error message mean by “2D array” vs “1D array” in the context of these pandas objects? Answer in a markdown cell.
Submission#
Reminder: everyone needs to make a submission on Canvas.
Reminder: Use markdown cells for text explanations, not Python comments. The only exception is if it is a very brief explanation at the top of a code cell.
Reminder: include everyone’s full name at the top, after Names.
Using the
Share
button at the top right, enable public sharing, and enable Comment privileges. Then submit the created link on Canvas.