Worksheet 11
Contents
Worksheet 11#
Authors (3 maximum; use your full names): BLANK
This worksheet is due Tuesday of Week 8, before discussion section. You are encouraged to work in groups of up to 3 total students, but each student should submit their own file. (It’s fine for everyone in the group to upload the same file.)
Recommendation. Follow the Worksheet 0 instructions to form a free Pro workspace for you and your groupmates, so you can all work on the same file.
The main part of this week’s homework is based on an example from Jake VanderPlas’s book, Python Data Science Handbook.
Downloading the data#
Go to https://data.seattle.gov/ and find the “Fremont Bridge Bicycle Counter” dataset (not the one called “Timeline”, the plain one). Download the csv file for that dataset (click the “Export” button at the top right), and upload that csv file to this Deepnote project. Rename the csv file to “Fremont.csv”. (You can click the three dots to the right of the file name, to reach the option to rename it. Or just rename it on your computer before you upload the file.)
Linear regression with Seattle bicycle data#
Question 1#
Read in the Fremont.csv file from Part 0, drop the rows which contain missing values, keep only the first two columns, and name the resulting DataFrame
df_pre
.Rename the “Fremont Bridge Total” column to “Bikes”, using a command of the form
df_pre.rename({???: ???}, axis=???, inplace=???)
Convert the “Date” column to a
datetime
data type by usingpd.to_datetime
.Using the
dt
accessor and several Boolean Series, define a new pandas DataFramedf_pre2
which contains only the rows indf_pre
from the year 2022, from September or earlier, and from the hour 8:00am in the morning. Use.copy()
to ensure thatdf_pre2
is a new DataFrame.Round the “Date” column to the nearest date (i.e., lose the 8:00am part) by using
dt
,round
(documentation), and the nearest calendar day offset. (Updated 11/14: I had previously said you could usedt.date
but that fails when we reach the Altair step. Instead usedt.round
.)
Question 2#
The weather data in the attached SeattleWeather.csv file was downloaded a few days ago from this website. (You don’t need to re-download it; just use the provided csv file in this Deepnote project.) The “PRCP” column in this csv file indicates the amount of precipitation that fell on that day.
Read in the contents of the SeattleWeather.csv file, drop the rows with missing data, and name the result
df_weather
.Rename the “DATE” column to “Date”.
Convert the “Date” column to a
datetime
data type.Keep only the “Date”, “PRCP”, and “TMIN” columns in
df_weather
, for example, by usingdf_weather = df_weather[???].copy()
. (The column “PRCP” stands for how much precipitation there was, presumably in inches, and “TMIN” stands for the minimum temperature for that day, in Fahrenheit.)Check your answer: the resulting DataFrame should have 304 rows and 3 columns, and the
dtypes
should be one ofdatetime64[ns]
and two offloat64
.
Question 3#
Use the pandas DataFrame method
merge
(documentation) withhow="inner"
to merge togetherdf_pre2
anddf_weather
on their “Date” columns. Name the resulting DataFramedf
.The resulting DataFrame
df
should have 273 rows and 4 columns. (Updated 11/9: It used to say 5 columns.)
Question 4#
Add a new column named “DayName” to
df
corresponding to the day of the week (like “Monday”) and a new column called “Month” todf
that contains the numerical month number (like11
). Use thedt
accessor again.Apply scikit-learn’s
OneHotEncoder
class to this “DayName” column. Name the objectencoder
.Convert the resulting object to a NumPy array using the
toarray
method. Name the resultarr
.The
encoder.get_feature_names_out()
array will contain strings like'DayName_Friday'
,'DayName_Monday'
. Using list comprehension, make the corresponding list'Friday'
,'Monday'
, … Make sure your list appears in the same order asencoder.get_feature_names_out()
. (One possibility is to use the string methodstrip
. Another possibility is to firstsplit
the string at"_"
…) Call the resulting listday_list
.Add 7 columns to
df
with the names fromday_list
and put the contents ofarr
into those columns. (This can be done in a short line of code:df[???] = ???
.)df
should now have 13 columns. (Updated 11/9: Used to say 10 columns incorrectly.)
Question 5#
Instantiate a scikit-learn
LinearRegression
objectreg
. When you createreg
, Use the keyword argumentfit_intercept
with an appropriate value to specify that we do not want to find an intercept in this case. (For this particular data, allowing an intercept value gives no extra flexibility.)Define a list
cols
containing all the numeric columns fromdf
except the “Bikes” column. Useis_numeric_dtype
frompandas.api.types
together with list comprehension anddf.columns
. You can either make a list of all the numeric columns and then remove “Bikes” using theremove
method, or you can already exclude “Bikes” inside the list comprehension (for example, by using a second condition in theif
clause).Fit
reg
usingcols
for the input variables, and using “Bikes” for the output variable.Add a “Pred” column to
df
containing thereg.predict
values.
Question 6#
Check the values of the fitted coefficients using
pd.Series(???, index=cols)
.
Answer the following questions in a markdown cell. (Not as a comment, but as a markdown cell. You should not need to use the Python comment symbol #
. In the course project, you will need to make many of these explanatory markdown cells.)
Is the “PRCP” value positive or negative? Does this make sense?
Is the “TMIN” value positive or negative? Does this make sense?
Do the results suggest that people are biking more for exercise or more to get to work/school? Briefly explain.
I expect that if we try the same thing later using also the October, November and December data, that the “Month” coefficient will decrease. Why do I expect that?
Question 7#
Make a small pandas DataFrame with only one row, with columns equal to
cols
, and initially filled with all zeros. Assign it to the variable namedf_test
. (One approach is tocopy
a one-row sub-DataFrame fromdf
with the appropriate columns and then set all values to0
. Be sure to usecopy
so that you don’t change any values indf
. Another approach is to usepd.DataFrame([???], columns=cols)
, where ??? is a list of zeros of the appropriate length. Another approach is to usepd.DataFrame([{???:??? for ??? in cols}])
, where the portion{???:??? for ??? in cols}
is what’s known as a dictionary comprehension; dictionary comprehensions are completely analogous to list comprehensions.)Replace some of the zeros with sample values, making a data point for which our model predicts a negative number of bike riders on December 1st, 2022. (Don’t look up what day of the week that is; use pandas to determine it.)
Instead of making a one-row pandas DataFrame, why didn’t we just use a pandas Series? (Hint. What does scikit-learn expect from the first input to
reg.fit
or from the input toreg.predict
?)
Question 8#
Replace the ??? terms so that the following produces the length-7 list
["Monday", "Tuesday", ..., "Sunday"]
.
ordered_days = [pd.to_datetime(???"11/???/2022").day_name() for ??? in range(7,???)]
To make an Altair chart
c
containing the actual data in the DataFrame, you can use the following code.
sel = alt.selection_single(fields=["DayName"], empty="none")
base = alt.Chart(df).mark_circle().encode(
x="Date",
y="Bikes",
tooltip=["Bikes", "Date", "DayName", "PRCP", "TMIN"],
size=alt.condition(sel, alt.value(60),alt.value(20)),
color=alt.Color("DayName", sort=ordered_days)
).add_selection(sel)
text = alt.Chart(df).mark_text(y=20, size=20).encode(
text="DayName",
opacity=alt.condition(sel, alt.value(1), alt.value(0))
)
c = base+text
Define that chart
c
and then display it.Try clicking on one of the points. What change happens? (Why are some points selected but not others?)
What line of this code guarantees that only the selected day name will be displayed (like “Monday”)?
Change the base chart so that selected points have an opacity value of
1
and the unselected points have an opacity value of0.5
.
Question 9#
Define a second chart
c1
which is a line chart instead of a scatter plot; which usescolor="red"
as amark_line()
keyword argument; which again uses “Date” for the x-axis; and which uses the predicted value instead of the actual value for the y-coordinate.Add a
transform_filter
toc1
, and filter using the selection object created for thec
chart above.Display a layered chart of
c
andc1
usingc+c1
. (If you try to just displayc1
by itself, it won’t work, because the selection object was never added toc1
.)If you try clicking on a point in the chart, you should now see all the points from the same day of the week highlighted, and you should also see a red predicted curve (the “line of best fit”) for that day of the week.
Question 10#
In the predicted curve for Monday, there is an extreme dip near the beginning of March. What is the cause for this dip? (Use the tooltip for the corresponding data point.)
In the actual data points for Monday, there are some very low values (especially towards the right side of the chart) that aren’t matched in the predicted curve. What do you think is the cause for these low values? (Hint. American students might have an unfair advantage on this question.)
The red curve represents a linear function, but it certainly doesn’t look linear. Why isn’t that a contradiction?
Why do the bounds change for the y-axis if you click on a Sunday point?
Reminder#
Every group member needs to submit this on Canvas (even if you all submit the same link).
Be sure you’ve included the (full) names of you and your group members at the top after “Authors”. You might consider restarting this notebook (the Deepnote refresh icon, not your browser refresh icon) and running it again to make sure it’s working.
Submission#
Using the Share button at the top right, enable Comment access level for anyone with the link. Then submit that link on Canvas.