Worksheet 2
Worksheet 2#
This worksheet is due Tuesday of Week 2, before discussion section. You are encouraged to work in groups of up to 3 total students, but each student should submit their own file. (It’s fine for everyone in the group to upload the same file.)
Recommendation. Follow the Worksheet 0 instructions to form a free Pro workspace for you and your groupmates, so you can all work on the same file.
These questions refer to the attached vending machines csv file, vend.csv.
What are the names of your group members? Answer as a list of strings. (Even if you work as a group of 1, still answer as a length-1 list. You won’t get full points if your answer is a string instead of a list.) Save this list with the variable name
names.
We saw on Tuesday several ways to determine how many transaction dates occurred on Saturday, but we didn’t see this using the
findmethod. For what value ofzdoes the following code produce the same answer?(df["TransDate"].str.find("Saturday") != z).sum()(Hint. Break this code up into pieces.)
Preliminary step. Define a pandas DataFrame
df_juneto be the sub-DataFrame of the full dataset in which the month for the transaction date is June. (Hints. Break this up into steps. Use the pandasto_datetimefunction and thedtaccessor attribute to find which transaction dates correspond to June. Then use Boolean indexing to get the corresponding sub-DataFrame.)
What machine (not to be confused with location) had the most transactions in June? Save the answer as a string with the variable name
june_mach.
What proportion of June transactions is that machine responsible for? Save the answer as a float with the variable name
june_prop.
What proportion of overall transactions is that machine responsible for? Save the answer as a float with the variable name
over_prop.
How many transactions were there for which the product name contains the word “Water”? Don’t assume the product name starts or ends with “Water”. (Hint. Even though Python strings do not have a
containsmethod, ifmyseriesis a pandas Series of strings, thenmyseries.strdoes have acontainsmethod.) Save your answer as an integer with the variable namewater.
Sort the dataset according to the “Prcd Date” column using the
sort_valuesmethod. Give an example of two indicesi,jwheredf.loc[i, "Prcd Date"]comes beforedf.loc[j, "Prcd Date"]in this sorted Series, but where the actual datedf.loc[i, "Prcd Date"]comes after the datedf.loc[j, "Prcd Date"]. (Do not usepd.to_datetime… the point is that the sorting is not correct when we sort as a Series of strings as opposed to a Series of timestamps.) Save your answer as a tuplewrong_sort = (i,j), whereiandjare integers. You’re expected to solve this by just browsing the sorted Series. You can try to find these values more systematically, but that’s definitely more difficult, and I don’t think it’s possible with what we’ve learned so far in Math 10.
For how many rows in the dataset is the “TransDate” different from the “Prcd Date”? (Use the
dtaccessor attribute.) Save your answer as a variable nameddiff_dates. (Hint. If your answer is over 6000, something went wrong and you are probably getting that every date is different.)
Get a link to share this notebook with comment access like on Worksheet 0. (Click the Share & publish link at the top right; enable public sharing; change the sharing permissions from View to Comment.) Save this link as a string with the variable name
link.
Create a tuple
(names, z, june_mach, june_prop, over_prop, water, wrong_sort, diff_dates, link)and save this tuple as a pickle file (see the Worksheet 1 instructions). Submit that pickle file as your Worksheet 2 submission on Canvas.