Worksheet 18#
Names (3 maximum; use your full names): BLANK
The main part of this worksheet is based on an example from Jake VanderPlas’s book, Python Data Science Handbook.
Gathering the data#
- Go to https://data.seattle.gov/ and find the “Fremont Bridge Bicycle Counter” dataset (not the one called “Timeline”, the plain one). Download the csv file for that dataset (click the “Export” button at the top right), and upload that csv file to this Deepnote project. Rename the csv file to “Fremont.csv”. (You can click the three dots to the right of the file name, to reach the option to rename it. Or just rename it on your computer before you upload the file.) 
- Read in the Fremont.csv file from Part 0 using pandas. Use the - parse_dateskeyword argument of- read_csvto tell pandas to convert the values in the “Date” column to dates. (You will probably need to look up the documentation for- read_csv. I originally tried using- parse_dates=True, but that didn’t work.)
- Drop the rows which contain missing values. 
- Keep only the first two columns and name the resulting DataFrame - df_pre.
- Rename the “Fremont Bridge Sidewalks, south of N 34th St” column to “Bikes”, using a command of the form 
df_pre.rename({???: ???}, axis=???, inplace=???)
- Using the - dtaccessor and two Boolean Series, define a new pandas DataFrame- df_pre2from- df_prewhich contains only the rows in- df_prefrom the year 2022 and from the hour 8:00am in the morning. Use- .copy()to ensure that- df_pre2is a new DataFrame.
- Round the “Date” column of - df_pre2to the nearest date (i.e., lose the 8:00am part) by using- dt,- round(documentation), and the nearest calendar day offset.
Check: the resulting DataFrame should have 365 rows and 2 columns.
Comment: At least in my dataset, the rows appear in a strange order, with August appearing at the beginning. This won’t affect us.
The weather data in the attached SeattleWeather.csv file was downloaded from this website.  (You don’t need to re-download it; just use the provided csv file in this Deepnote project.)  You can find definitions for the columns in this csv file in the attached WeatherDoc.pdf file.
- Read in the contents of the SeattleWeather.csv file, again use - parse_datesto convert the “DATE” column to a datetime data type, and name the resulting DataFrame- df_weather.
- Use the pandas DataFrame method - merge(documentation) with- how="inner"to merge together- df_pre2and- df_weatheron their columns containing dates. Use the keyword arguments- left_onand- right_onbecause the date columns are spelled differently in these two datasets. Name the resulting DataFrame- df.
- Using list comprehension, make a list - wt_listof all the column names in- dfwhich begin with the letters- "WT".
- Add a column “IsWeekend” which contains - Trueif the date is “Saturday” or “Sunday” and is- Falseotherwise. (Hint. Use- dtand the- day_namemethod, together with- isin.)
- Add a column “Month” which contains the numeric month value. 
- Using - df[wt_list] = df[wt_list]...and the- fillnamethod, replace all the missing values in these columns with- 0.
- Remove the “PGTM” column from - df, by using the- dropmethod.
- Drop the rows with missing values. 
Check: the resulting DataFrame should have 364 rows and 25 columns.
Machine Learning#
- Define a list - featurescontaining all the numeric columns from- dfexcept the “Bikes” column. Use- is_numeric_dtypefrom- pandas.api.types.
Decision tree#
- Fit a Decision Tree regressor - regwith a maximum of- 6leaf nodes to the data, using the columns in- featuresas the input variables and using the “Bikes” column as the target.
- Visualize this decision tree using the following code. (Feel free to change the - figsizepart if it helps.)
import matplotlib.pyplot as plt
from sklearn.tree import plot_tree
fig = plt.figure(figsize=(20,10))
_ = plot_tree(reg, feature_names=reg.feature_names_in_, filled=True)
Using the above tree diagram, answer the following in a markdown cell.
- List all the columns which are referenced in this decision tree. 
- Imagine a 55 degree Tuesday in August. What value would be predicted by our decision tree? 
- What about a 60 degree Tuesday in August? 
- Check the reported feature importances using the following code. 
pd.Series(reg.feature_importances_, index=reg.feature_names_in_).sort_values(ascending=False)
- Add a column “DT_Pred” (for “Decision Tree Prediction”) to - dfcorresponding to the predicted values from- regfor each row in- df.
- Plot the data together (in red) with the true data (in black) using the following code. 
c1 = alt.Chart(df).mark_line(color="black").encode(
    x="Date",
    y="Bikes"
).properties(
    width=1000
)
c2 = alt.Chart(df).mark_line(color="red").encode(
    x="Date",
    y="DT_Pred"
)
c1+c2
- Recall that we required our decision tree regressor to have at most 6 leaf nodes. How is that number 6 reflected in the above chart? 
Random forest#
- Fit a Random Forest regressor - rfrto the data, using 200 estimators, each with a maximum of- 6leaf nodes.
- Check the reported feature importances, by adapting the code from the Decision Tree portion. 
- What do you think is the most noteworthy difference between these values and the values we found above for a single decision tree? 
- Add a column “RF_Pred” to - dfcontaining the predicted values from- rfr.
- Make an Altair chart as above, but using the “RF_Pred” column instead of the “DT_Pred” column. 
- What do you think is the most extreme difference between this chart and the above chart made using the decision tree? 
Linear regression#
- Fit a - LinearRegressionobject- linto the data.
- We don’t have a - feature_importances_attribute for a- LinearRegressionobject, but we can at least check the signs of the coefficients. (The relative sizes of the numbers I think are not that meaningful. To meaninfully compare the sizes of the coefficients, we should first rescale the data.)
pd.Series(lin.coef_, index=lin.feature_names_in_)
Answer the following questions in a markdown cell.
- What is the sign of the “PRCP” coefficient? Does it make sense? Why? 
- What is the sign of the “TMAX” coefficient? Does it make sense? Why? 
- Does the sign of the “IsWeekend” coefficient suggest people are biking more often to go to work/school, or for recreation? Briefly explain. 
- Add a column “Lin_Pred” to - dfcontaining the predicted values from- lin.
- Make an Altair chart as above, but using the “Lin_Pred” column for the predictions 
- Why does the above linear regression function look so different from a straight line? The graph certainly doesn’t look “linear”. How is that possible? Explain in a markdown cell. 
* Imagine we have some Machine Learning model (not linear regression, just some model) whose predictions exactly match the true bike data.  Why is that more likely to be a bad sign than a good sign?
Submission#
- Reminder: everyone needs to make a submission on Canvas. 
- Reminder: include everyone’s full name at the top, after Names. 
- Using the - Sharebutton at the top right, enable public sharing, and enable Comment privileges. Then submit the created link on Canvas.
