Worksheet 18

Worksheet 18#

Names (3 maximum; use your full names): BLANK

The main part of this worksheet is based on an example from Jake VanderPlas’s book, Python Data Science Handbook.

Gathering the data#

Go to https://data.seattle.gov/ and find the “Fremont Bridge Bicycle Counter” dataset (not the one called “Timeline”, the plain one). Download the csv file for that dataset (click the “Export” button at the top right), and upload that csv file to this Deepnote project. Rename the csv file to “Fremont.csv”. (You can click the three dots to the right of the file name, to reach the option to rename it. Or just rename it on your computer before you upload the file.)

Read in the Fremont.csv file from Part 0 using pandas. Use the parse_dates keyword argument of read_csv to tell pandas to convert the values in the “Date” column to dates. (You will probably need to look up the documentation for read_csv. I originally tried using parse_dates=True, but that didn’t work.)
Drop the rows which contain missing values.
Keep only the first two columns and name the resulting DataFrame df_pre.

Rename the “Fremont Bridge Sidewalks, south of N 34th St” column to “Bikes”, using a command of the form

df_pre.rename({???: ???}, axis=???, inplace=???)

Using the dt accessor and two Boolean Series, define a new pandas DataFrame df_pre2 from df_pre which contains only the rows in df_pre from the year 2022 and from the hour 8:00am in the morning. Use .copy() to ensure that df_pre2 is a new DataFrame.
Round the “Date” column of df_pre2 to the nearest date (i.e., lose the 8:00am part) by using dt, round (documentation), and the nearest calendar day offset.

Check: the resulting DataFrame should have 365 rows and 2 columns.

Comment: At least in my dataset, the rows appear in a strange order, with August appearing at the beginning. This won’t affect us.

The weather data in the attached SeattleWeather.csv file was downloaded from this website. (You don’t need to re-download it; just use the provided csv file in this Deepnote project.) You can find definitions for the columns in this csv file in the attached WeatherDoc.pdf file.

Read in the contents of the SeattleWeather.csv file, again use parse_dates to convert the “DATE” column to a datetime data type, and name the resulting DataFrame df_weather.

Use the pandas DataFrame method merge (documentation) with how="inner" to merge together df_pre2 and df_weather on their columns containing dates. Use the keyword arguments left_on and right_on because the date columns are spelled differently in these two datasets. Name the resulting DataFrame df.

Using list comprehension, make a list wt_list of all the column names in df which begin with the letters "WT".

Add a column “IsWeekend” which contains True if the date is “Saturday” or “Sunday” and is False otherwise. (Hint. Use dt and the day_name method, together with isin.)

Add a column “Month” which contains the numeric month value.

Using df[wt_list] = df[wt_list]... and the fillna method, replace all the missing values in these columns with 0.
Remove the “PGTM” column from df, by using the drop method.
Drop the rows with missing values.

Check: the resulting DataFrame should have 364 rows and 25 columns.

Machine Learning#

Define a list features containing all the numeric columns from df except the “Bikes” column. Use is_numeric_dtype from pandas.api.types.

Decision tree#

Fit a Decision Tree regressor reg with a maximum of 6 leaf nodes to the data, using the columns in features as the input variables and using the “Bikes” column as the target.

Visualize this decision tree using the following code. (Feel free to change the figsize part if it helps.)

import matplotlib.pyplot as plt
from sklearn.tree import plot_tree

fig = plt.figure(figsize=(20,10))
_ = plot_tree(reg, feature_names=reg.feature_names_in_, filled=True)

Using the above tree diagram, answer the following in a markdown cell.

List all the columns which are referenced in this decision tree.
Imagine a 55 degree Tuesday in August. What value would be predicted by our decision tree?
What about a 60 degree Tuesday in August?

Check the reported feature importances using the following code.

pd.Series(reg.feature_importances_, index=reg.feature_names_in_).sort_values(ascending=False)

Add a column “DT_Pred” (for “Decision Tree Prediction”) to df corresponding to the predicted values from reg for each row in df.

Plot the data together (in red) with the true data (in black) using the following code.

c1 = alt.Chart(df).mark_line(color="black").encode(
    x="Date",
    y="Bikes"
).properties(
    width=1000
)

c2 = alt.Chart(df).mark_line(color="red").encode(
    x="Date",
    y="DT_Pred"
)

c1+c2

Recall that we required our decision tree regressor to have at most 6 leaf nodes. How is that number 6 reflected in the above chart?

Random forest#

Fit a Random Forest regressor rfr to the data, using 200 estimators, each with a maximum of 6 leaf nodes.

Check the reported feature importances, by adapting the code from the Decision Tree portion.

What do you think is the most noteworthy difference between these values and the values we found above for a single decision tree?

Add a column “RF_Pred” to df containing the predicted values from rfr.

Make an Altair chart as above, but using the “RF_Pred” column instead of the “DT_Pred” column.

What do you think is the most extreme difference between this chart and the above chart made using the decision tree?

Linear regression#

Fit a LinearRegression object lin to the data.

We don’t have a feature_importances_ attribute for a LinearRegression object, but we can at least check the signs of the coefficients. (The relative sizes of the numbers I think are not that meaningful. To meaninfully compare the sizes of the coefficients, we should first rescale the data.)

pd.Series(lin.coef_, index=lin.feature_names_in_)

Answer the following questions in a markdown cell.

What is the sign of the “PRCP” coefficient? Does it make sense? Why?
What is the sign of the “TMAX” coefficient? Does it make sense? Why?
Does the sign of the “IsWeekend” coefficient suggest people are biking more often to go to work/school, or for recreation? Briefly explain.

Add a column “Lin_Pred” to df containing the predicted values from lin.

Make an Altair chart as above, but using the “Lin_Pred” column for the predictions

Why does the above linear regression function look so different from a straight line? The graph certainly doesn’t look “linear”. How is that possible? Explain in a markdown cell.

* Imagine we have some Machine Learning model (not linear regression, just some model) whose predictions exactly match the true bike data.  Why is that more likely to be a bad sign than a good sign?

Submission#

Reminder: everyone needs to make a submission on Canvas.
Reminder: include everyone’s full name at the top, after Names.
Using the Share button at the top right, enable public sharing, and enable Comment privileges. Then submit the created link on Canvas.

Created in Deepnote