Worksheet 18#
Names (3 maximum; use your full names): BLANK
The main part of this worksheet is based on an example from Jake VanderPlas’s book, Python Data Science Handbook.
Gathering the data#
Go to https://data.seattle.gov/ and find the “Fremont Bridge Bicycle Counter” dataset (not the one called “Timeline”, the plain one). Download the csv file for that dataset (click the “Export” button at the top right), and upload that csv file to this Deepnote project. Rename the csv file to “Fremont.csv”. (You can click the three dots to the right of the file name, to reach the option to rename it. Or just rename it on your computer before you upload the file.)
Read in the Fremont.csv file from Part 0 using pandas. Use the
parse_dates
keyword argument ofread_csv
to tell pandas to convert the values in the “Date” column to dates. (You will probably need to look up the documentation forread_csv
. I originally tried usingparse_dates=True
, but that didn’t work.)Drop the rows which contain missing values.
Keep only the first two columns and name the resulting DataFrame
df_pre
.
Rename the “Fremont Bridge Sidewalks, south of N 34th St” column to “Bikes”, using a command of the form
df_pre.rename({???: ???}, axis=???, inplace=???)
Using the
dt
accessor and two Boolean Series, define a new pandas DataFramedf_pre2
fromdf_pre
which contains only the rows indf_pre
from the year 2022 and from the hour 8:00am in the morning. Use.copy()
to ensure thatdf_pre2
is a new DataFrame.Round the “Date” column of
df_pre2
to the nearest date (i.e., lose the 8:00am part) by usingdt
,round
(documentation), and the nearest calendar day offset.
Check: the resulting DataFrame should have 365 rows and 2 columns.
Comment: At least in my dataset, the rows appear in a strange order, with August appearing at the beginning. This won’t affect us.
The weather data in the attached SeattleWeather.csv
file was downloaded from this website. (You don’t need to re-download it; just use the provided csv file in this Deepnote project.) You can find definitions for the columns in this csv file in the attached WeatherDoc.pdf
file.
Read in the contents of the SeattleWeather.csv file, again use
parse_dates
to convert the “DATE” column to a datetime data type, and name the resulting DataFramedf_weather
.
Use the pandas DataFrame method
merge
(documentation) withhow="inner"
to merge togetherdf_pre2
anddf_weather
on their columns containing dates. Use the keyword argumentsleft_on
andright_on
because the date columns are spelled differently in these two datasets. Name the resulting DataFramedf
.
Using list comprehension, make a list
wt_list
of all the column names indf
which begin with the letters"WT"
.
Add a column “IsWeekend” which contains
True
if the date is “Saturday” or “Sunday” and isFalse
otherwise. (Hint. Usedt
and theday_name
method, together withisin
.)
Add a column “Month” which contains the numeric month value.
Using
df[wt_list] = df[wt_list]...
and thefillna
method, replace all the missing values in these columns with0
.Remove the “PGTM” column from
df
, by using thedrop
method.Drop the rows with missing values.
Check: the resulting DataFrame should have 364 rows and 25 columns.
Machine Learning#
Define a list
features
containing all the numeric columns fromdf
except the “Bikes” column. Useis_numeric_dtype
frompandas.api.types
.
Decision tree#
Fit a Decision Tree regressor
reg
with a maximum of6
leaf nodes to the data, using the columns infeatures
as the input variables and using the “Bikes” column as the target.
Visualize this decision tree using the following code. (Feel free to change the
figsize
part if it helps.)
import matplotlib.pyplot as plt
from sklearn.tree import plot_tree
fig = plt.figure(figsize=(20,10))
_ = plot_tree(reg, feature_names=reg.feature_names_in_, filled=True)
Using the above tree diagram, answer the following in a markdown cell.
List all the columns which are referenced in this decision tree.
Imagine a 55 degree Tuesday in August. What value would be predicted by our decision tree?
What about a 60 degree Tuesday in August?
Check the reported feature importances using the following code.
pd.Series(reg.feature_importances_, index=reg.feature_names_in_).sort_values(ascending=False)
Add a column “DT_Pred” (for “Decision Tree Prediction”) to
df
corresponding to the predicted values fromreg
for each row indf
.
Plot the data together (in red) with the true data (in black) using the following code.
c1 = alt.Chart(df).mark_line(color="black").encode(
x="Date",
y="Bikes"
).properties(
width=1000
)
c2 = alt.Chart(df).mark_line(color="red").encode(
x="Date",
y="DT_Pred"
)
c1+c2
Recall that we required our decision tree regressor to have at most 6 leaf nodes. How is that number 6 reflected in the above chart?
Random forest#
Fit a Random Forest regressor
rfr
to the data, using 200 estimators, each with a maximum of6
leaf nodes.
Check the reported feature importances, by adapting the code from the Decision Tree portion.
What do you think is the most noteworthy difference between these values and the values we found above for a single decision tree?
Add a column “RF_Pred” to
df
containing the predicted values fromrfr
.
Make an Altair chart as above, but using the “RF_Pred” column instead of the “DT_Pred” column.
What do you think is the most extreme difference between this chart and the above chart made using the decision tree?
Linear regression#
Fit a
LinearRegression
objectlin
to the data.
We don’t have a
feature_importances_
attribute for aLinearRegression
object, but we can at least check the signs of the coefficients. (The relative sizes of the numbers I think are not that meaningful. To meaninfully compare the sizes of the coefficients, we should first rescale the data.)
pd.Series(lin.coef_, index=lin.feature_names_in_)
Answer the following questions in a markdown cell.
What is the sign of the “PRCP” coefficient? Does it make sense? Why?
What is the sign of the “TMAX” coefficient? Does it make sense? Why?
Does the sign of the “IsWeekend” coefficient suggest people are biking more often to go to work/school, or for recreation? Briefly explain.
Add a column “Lin_Pred” to
df
containing the predicted values fromlin
.
Make an Altair chart as above, but using the “Lin_Pred” column for the predictions
Why does the above linear regression function look so different from a straight line? The graph certainly doesn’t look “linear”. How is that possible? Explain in a markdown cell.
* Imagine we have some Machine Learning model (not linear regression, just some model) whose predictions exactly match the true bike data. Why is that more likely to be a bad sign than a good sign?
Submission#
Reminder: everyone needs to make a submission on Canvas.
Reminder: include everyone’s full name at the top, after Names.
Using the
Share
button at the top right, enable public sharing, and enable Comment privileges. Then submit the created link on Canvas.