Worksheet 10

Worksheet 10#

You are encouraged to work in groups of up to 3 total students, but each student should make their own submission on Canvas. (It’s fine for everyone in the group to have the same upload.)

Put the full names of everyone in your group (even if you’re working alone) here. (This makes grading easier.)

Names:

Part 1: Predicting MPG using one feature#

Load the mpg dataset from Seaborn and store it with the variable name df.

Drop the rows that contain missing values from the dataset, using df.dropna(axis=???). Save the resulting DataFrame also as df, either using df = df.dropna(axis=???) or by using df.dropna(axis=???, inplace=???) (no df= in this second one).

Using Altair, make a scatter plot with “horsepower” along the x-axis and with “mpg” along the y-axis.

Based on this chart, answer the following questions in a markdown cell.

In a line of best fit \(\text{mpg} \approx m \cdot \text{horsepower} + b\), based on the above chart, what would you estimate for \(m\) and \(b\)? (Just a rough estimate is fine.)
In particular, does the chart suggest \(m\) is positive or negative? Does that make sense intuitively, from the real-world meaning of this data?
Does a straight line seem to be a good fit for this data? Write about one sentence giving your opinion. (I don’t think there is a clear correct answer for this, just tell me how it seems to you.)

Be sure you are answering in a markdown cell, not in a Python comment.

Find values of \(m\) and \(b\) (in the notation above) using the LinearRegression class from sklearn.linear_model. Assign these values to variables m and b.

Verify that m is actually a float, not a NumPy array, using type. If not, then fix it.

Hint. If arr is a NumPy array containing just a single element in it, you can access that single element using the item method. Or you could just use arr[0].

Part 2: Predicting MPG using multiple features#

Now we will use four input variables rather than just one.

Define a list pred_cols containing the names of the four input columns we will use.

pred_cols = ["horsepower", "weight", "model_year", "cylinders"]

Fit a new LinearRegression object using these four columns as input features and again using “mpg” as the target value. Store the resulting LinearRegression object with the variable name reg2.

Evaluate reg2.coef_. It is a little annoying that we can’t immediately tell which number corresponds to which variable.

Here is a nice way to group the feature names with the values in a pandas Series. Execute the following code, after filling in an appropriate value for the index.

ser = pd.Series(reg2.coef_, index=???)

I’m expecting you to use pred_cols, but another option that might be more reliable is to use reg2.feature_names_in_.

Examine the pandas Series you made above and answer the following questions in a markdown cell.

Only one of the four coefficients is positive. Explain why it makes sense that particular coefficient is positive, in terms of the real-world meaning of this data.
Imagine a car from 1999 weighing 3800 pounds, with 200 horsepower, and with 4 cylinders. What would our model estimate for the mpg for this car? (Hint 1. Notice that a year like 1974 is not written as 1974 in the DataFrame, so don’t use 1999 as one of your inputs. Hint 2. Don’t forget the bias/intercept.)
The horsepower coefficient you computed is quite different from the one computed in Part 1. Why are these numbers so different? (Idea: In Part 1, we are learning how mpg changes when horsepower increases. In this part, we are learning how mpg changes when horsepower increases and these other values stay fixed. Why would you expect that to make a difference?)

Part 3: Estimating MPG for a hypothetical car#

We will check if your computation above was correct for the 1999 car.

Define ser to be a copy of the top row of df. (In other words, get the top row of df however you normally would, and then call .copy() at the end of your code. This way, you can make changes to this Series without affecting df.)

Define a length-4 Python dictionary dct with keys “weight”, “cylinders”, “horsepower”, and “model_year” and with values matching what was used in the “imagine a car” question above. For example,

dct = {"weight": 3800, ...}

Call the update method of ser and pass it dct as an argument.

Evaluate ser. Notice that the values have been changed to match our hypothetical car.

Make a length-1 list whose only element is the pandas Series ser.

Pass that length-1 list to pd.DataFrame to construct a one row DataFrame. Assign the result to the variable df2.

Evaluate df2. You should see a length-1 DataFrame, with values matching those of our hypothetical car (and some leftover columns from the original row).

Use reg2.predict to find the predicted mpg corresponding to this hypothetical car.

Hint. Remember that we are only using the columns from the list pred_cols.

What error message do we get if you try to do the same thing but using ser[pred_cols] instead of df2[pred_cols]?

What does the error message mean by “2D array” vs “1D array” in the context of these pandas objects? Answer in a markdown cell.

Submission#

Reminder: everyone needs to make a submission on Canvas.
Reminder: Use markdown cells for text explanations, not Python comments. The only exception is if it is a very brief explanation at the top of a code cell.
Reminder: include everyone’s full name at the top, after Names.
Using the Share button at the top right, enable public sharing, and enable Comment privileges. Then submit the created link on Canvas.

Created in Deepnote