Homework 5

Author: BLANK

Collaborators: BLANK

The main part of this week’s homework is based on an example from Jake VanderPlas’s book, Python Data Science Handbook.

Part 0 - Downloading the data

Go to https://data.seattle.gov/ and find the “Fremont Bridge Bicycle Counter” dataset (not the one called “Timeline”, the plain one). Download the csv file for that dataset (click the “Export” button at the top right), and upload that csv file to this Deepnote project. Rename the csv file to “Fremont.csv”. (You can click the three dots to the right of the file name, to reach the option to rename it. Or just rename it on your computer before you upload the file.)

Part 1 - Worksheet on clustering (Optional)

This Part 1 is optional and worth up to 4 bonus points in the Homework category.

Complete either the Thursday or Friday worksheet (your choice) from Week 5 and upload it into this same project. (So in your final submission, there should be 4 files in this project: Fremont.csv, SeattleWeather.csv, Homework5.ipynb, and one of the two completed worksheets.)

Part 2 - Linear regression with Seattle bicycle data

Question 1

  • Read in the Fremont.csv file from Part 0, drop the rows which contain missing values, keep only the first two columns, and name the resulting DataFrame df_pre.

  • Rename the “Fremont Bridge Total” column to “Bikes”.

  • Convert the “Date” column to a datetime data type by using pd.to_datetime. (The exact dtype will be displayed as datetime64[ns]. This step may take about 30 seconds.)

  • Using the dt accessor and two Boolean Series, define a new pandas DataFrame df_pre2 which contains only the rows in df_pre from the year 2022 and from the hour 8:00am in the morning. Use .copy() to ensure that df_pre2 is a new DataFrame.

  • Round the “Date” column to the nearest date (i.e., lose the 8:00am part) by using the following code: df_pre2["Date"] = df_pre["Date"].dt.round("d")

  • Check your answer: df_pre2 should have 120 rows and 2 columns.

Question 2

The weather data in the SeattleWeather.csv file was downloaded a few days ago from this website. (You don’t need to re-download it; just use the provided csv file in this Deepnote project.) The “PRCP” column in this csv file indicates the amount of precipitation that fell on that day.

  • Read in the contents of the SeattleWeather.csv file, drop the rows with missing data, and name the result df_weather.

  • Rename the “DATE” column to “Date”.

  • Convert the “Date” column to a datetime data type.

  • Keep only the “Date” and “PRCP” columns in df_weather, for example, by using df_weather = df_weather[???].copy().

  • Check your answer: the resulting DataFrame should have 116 rows and 2 columns, and the dtypes should be datetime64[ns] and float64.

Question 3

  • Using merge with type “inner”, merge together df_pre2 and df_weather on their “Date” columns. (See this week’s videos for information on merge.) Name the resulting DataFrame df.

  • The resulting DataFrame df should have 116 rows and 3 columns.

Question 4

  • Midterm 1 Review: pd.to_datetime("2022-05-02").day_name() is “Monday”. Using this idea, list comprehension, and f-strings, make the length 7 list ["Monday", "Tuesday", ..., "Saturday", "Sunday"] and save it with the variable name days.

  • Add 7 columns to df with the names “Monday”, …, “Sunday”, and fill in all the values with 0. (This is easier than it sounds. Just use df[days] = 0.)

  • df should now have 10 columns.

Question 5

  • Add a new column called “DayName” to df that contains the day name for that row. For example, January 1st of this year was a Saturday, so the initial value in the “DayName” column should be “Saturday”. (Suggestion: use df["DayName"] = df["Date"].map(???).)

  • Add a new column called “Month” to df that contains the numerical month number. Use the dt accessor again.

Question 6

  • If we want to know which rows correspond to “Monday”, we can use df["DayName"] == "Monday".

  • If we want to set the value in the “Monday” column of those rows to 1, we can use df.loc[df["DayName"] == "Monday", "Monday"] = 1.

  • Using that idea, a for loop, and the days list from above, set the “Monday” column of all “Monday” rows to 1, set the “Tuesday” column of all “Tuesday” rows to 1, etc.

  • Check your answer. If you evaluate df[days].sum(axis=0), you should see that there are 17 Mondays and 16 Thursdays. (One Sunday value seems to be missing.)

Question 7

  • Create a scikit-learn LinearRegression object reg. When you create reg, specify the keyword argument fit_intercept=False. (For this particular data, allowing an intercept value gives no extra flexibility.)

  • Define a length 9 list cols containing “PRCP”, “Month”, and all the days from the days list.

  • Fit reg using cols for the input variables, and using “Bikes” for the output variable.

  • Add a “Pred” column to df containing the reg.predict values.

Question 8

  • Check the values of the fitted coefficients using pd.Series(???, index=cols).

  • Is the “PRCP” value positive or negative? Does this make sense?

  • Which day do the most people bike? Which day do the fewest?

  • Do people tend to bike more or less as the months change from January to April?

  • Why might the results get less accurate with respect to the month coefficient if our data contained all months from January to December?

Question 9

To make an Altair chart c containing the actual data in the DataFrame, you can use the following code.

sel = alt.selection_single(fields=["DayName"])

c = alt.Chart(df).mark_circle().encode(
    x="Date",
    y="Bikes",
    tooltip=["Bikes", "Date", "DayName", "PRCP"],
    size=alt.condition(sel, alt.value(40),alt.value(10))
).add_selection(sel)
  • Define that chart c and then display it.

  • Try clicking on one of the points. What change happens? (Why are some points selected but not others?)

  • What 3 parts of this Altair code are necessary for this interactivity to work?

Question 10

  • Define a second chart c1 which is a line chart instead of a scatter plot; which uses color="red" as a mark_line() argument; which again uses “Date” for the x-axis; and which uses the predicted value instead of the actual value for the y-coordinate.

  • Display a layered chart of c and c1 using c+c1.

Question 11

  • The red curve represents a linear function, but it certainly doesn’t look linear. Why isn’t that a contradiction?

  • There are frequent lower dips in the red curve. What do those local minima represent? (In other words, why do you think our function has learned to have local minima in those spots?)

Submission

To submit this homework, go to the Share option at the top right, and share the project to create a link, and then submit that link on Canvas.

Created in deepnote.com Created in Deepnote