Homework 5
Contents
Homework 5¶
Author: BLANK
Collaborators: BLANK
The main part of this week’s homework is based on an example from Jake VanderPlas’s book, Python Data Science Handbook.
Part 0 - Downloading the data¶
Go to https://data.seattle.gov/ and find the “Fremont Bridge Bicycle Counter” dataset (not the one called “Timeline”, the plain one). Download the csv file for that dataset (click the “Export” button at the top right), and upload that csv file to this Deepnote project. Rename the csv file to “Fremont.csv”. (You can click the three dots to the right of the file name, to reach the option to rename it. Or just rename it on your computer before you upload the file.)
Part 1 - Worksheet on clustering (Optional)¶
This Part 1 is optional and worth up to 4 bonus points in the Homework category.
Complete either the Thursday or Friday worksheet (your choice) from Week 5 and upload it into this same project. (So in your final submission, there should be 4 files in this project: Fremont.csv
, SeattleWeather.csv
, Homework5.ipynb
, and one of the two completed worksheets.)
Part 2 - Linear regression with Seattle bicycle data¶
Question 1¶
Read in the Fremont.csv file from Part 0, drop the rows which contain missing values, keep only the first two columns, and name the resulting DataFrame
df_pre
.Rename the “Fremont Bridge Total” column to “Bikes”.
Convert the “Date” column to a
datetime
data type by usingpd.to_datetime
. (The exactdtype
will be displayed asdatetime64[ns]
. This step may take about 30 seconds.)Using the
dt
accessor and two Boolean Series, define a new pandas DataFramedf_pre2
which contains only the rows indf_pre
from the year 2022 and from the hour 8:00am in the morning. Use.copy()
to ensure thatdf_pre2
is a new DataFrame.Round the “Date” column to the nearest date (i.e., lose the 8:00am part) by using the following code:
df_pre2["Date"] = df_pre["Date"].dt.round("d")
Check your answer:
df_pre2
should have 120 rows and 2 columns.
Question 2¶
The weather data in the SeattleWeather.csv file was downloaded a few days ago from this website. (You don’t need to re-download it; just use the provided csv file in this Deepnote project.) The “PRCP” column in this csv file indicates the amount of precipitation that fell on that day.
Read in the contents of the SeattleWeather.csv file, drop the rows with missing data, and name the result
df_weather
.Rename the “DATE” column to “Date”.
Convert the “Date” column to a
datetime
data type.Keep only the “Date” and “PRCP” columns in
df_weather
, for example, by usingdf_weather = df_weather[???].copy()
.Check your answer: the resulting DataFrame should have 116 rows and 2 columns, and the
dtypes
should bedatetime64[ns]
andfloat64
.
Question 3¶
Using
merge
with type “inner”, merge togetherdf_pre2
anddf_weather
on their “Date” columns. (See this week’s videos for information onmerge
.) Name the resulting DataFramedf
.The resulting DataFrame
df
should have 116 rows and 3 columns.
Question 4¶
Midterm 1 Review:
pd.to_datetime("2022-05-02").day_name()
is “Monday”. Using this idea, list comprehension, and f-strings, make the length 7 list["Monday", "Tuesday", ..., "Saturday", "Sunday"]
and save it with the variable namedays
.Add 7 columns to
df
with the names “Monday”, …, “Sunday”, and fill in all the values with0
. (This is easier than it sounds. Just usedf[days] = 0
.)df
should now have 10 columns.
Question 5¶
Add a new column called “DayName” to
df
that contains the day name for that row. For example, January 1st of this year was a Saturday, so the initial value in the “DayName” column should be “Saturday”. (Suggestion: usedf["DayName"] = df["Date"].map(???)
.)Add a new column called “Month” to
df
that contains the numerical month number. Use thedt
accessor again.
Question 6¶
If we want to know which rows correspond to “Monday”, we can use
df["DayName"] == "Monday"
.If we want to set the value in the “Monday” column of those rows to
1
, we can usedf.loc[df["DayName"] == "Monday", "Monday"] = 1
.Using that idea, a for loop, and the
days
list from above, set the “Monday” column of all “Monday” rows to 1, set the “Tuesday” column of all “Tuesday” rows to 1, etc.Check your answer. If you evaluate
df[days].sum(axis=0)
, you should see that there are 17 Mondays and 16 Thursdays. (One Sunday value seems to be missing.)
Question 7¶
Create a scikit-learn
LinearRegression
objectreg
. When you createreg
, specify the keyword argumentfit_intercept=False
. (For this particular data, allowing an intercept value gives no extra flexibility.)Define a length 9 list
cols
containing “PRCP”, “Month”, and all the days from thedays
list.Fit
reg
usingcols
for the input variables, and using “Bikes” for the output variable.Add a “Pred” column to
df
containing thereg.predict
values.
Question 8¶
Check the values of the fitted coefficients using
pd.Series(???, index=cols)
.Is the “PRCP” value positive or negative? Does this make sense?
Which day do the most people bike? Which day do the fewest?
Do people tend to bike more or less as the months change from January to April?
Why might the results get less accurate with respect to the month coefficient if our data contained all months from January to December?
Question 9¶
To make an Altair chart c
containing the actual data in the DataFrame, you can use the following code.
sel = alt.selection_single(fields=["DayName"])
c = alt.Chart(df).mark_circle().encode(
x="Date",
y="Bikes",
tooltip=["Bikes", "Date", "DayName", "PRCP"],
size=alt.condition(sel, alt.value(40),alt.value(10))
).add_selection(sel)
Define that chart
c
and then display it.Try clicking on one of the points. What change happens? (Why are some points selected but not others?)
What 3 parts of this Altair code are necessary for this interactivity to work?
Question 10¶
Define a second chart
c1
which is a line chart instead of a scatter plot; which usescolor="red"
as amark_line()
argument; which again uses “Date” for the x-axis; and which uses the predicted value instead of the actual value for the y-coordinate.Display a layered chart of
c
andc1
usingc+c1
.
Question 11¶
The red curve represents a linear function, but it certainly doesn’t look linear. Why isn’t that a contradiction?
There are frequent lower dips in the red curve. What do those local minima represent? (In other words, why do you think our function has learned to have local minima in those spots?)
Submission¶
To submit this homework, go to the Share option at the top right, and share the project to create a link, and then submit that link on Canvas.
Created in Deepnote