Worksheet 2#

This worksheet is due Tuesday of Week 2, before discussion section. You are encouraged to work in groups of up to 3 total students, but each student should submit their own file. (It’s fine for everyone in the group to upload the same file.)

Recommendation. Follow the Worksheet 0 instructions to form a free Pro workspace for you and your groupmates, so you can all work on the same file.

These questions refer to the attached vending machines csv file, vend.csv.

  • What are the names of your group members? Answer as a list of strings. (Even if you work as a group of 1, still answer as a length-1 list. You won’t get full points if your answer is a string instead of a list.) Save this list with the variable name names.

  • We saw on Tuesday several ways to determine how many transaction dates occurred on Saturday, but we didn’t see this using the find method. For what value of z does the following code produce the same answer? (df["TransDate"].str.find("Saturday") != z).sum() (Hint. Break this code up into pieces.)

  • Preliminary step. Define a pandas DataFrame df_june to be the sub-DataFrame of the full dataset in which the month for the transaction date is June. (Hints. Break this up into steps. Use the pandas to_datetime function and the dt accessor attribute to find which transaction dates correspond to June. Then use Boolean indexing to get the corresponding sub-DataFrame.)

  • What machine (not to be confused with location) had the most transactions in June? Save the answer as a string with the variable name june_mach.

  • What proportion of June transactions is that machine responsible for? Save the answer as a float with the variable name june_prop.

  • What proportion of overall transactions is that machine responsible for? Save the answer as a float with the variable name over_prop.

  • How many transactions were there for which the product name contains the word “Water”? Don’t assume the product name starts or ends with “Water”. (Hint. Even though Python strings do not have a contains method, if myseries is a pandas Series of strings, then myseries.str does have a contains method.) Save your answer as an integer with the variable name water.

  • Sort the dataset according to the “Prcd Date” column using the sort_values method. Give an example of two indices i,j where df.loc[i, "Prcd Date"] comes before df.loc[j, "Prcd Date"] in this sorted Series, but where the actual date df.loc[i, "Prcd Date"] comes after the date df.loc[j, "Prcd Date"]. (Do not use pd.to_datetime… the point is that the sorting is not correct when we sort as a Series of strings as opposed to a Series of timestamps.) Save your answer as a tuple wrong_sort = (i,j), where i and j are integers. You’re expected to solve this by just browsing the sorted Series. You can try to find these values more systematically, but that’s definitely more difficult, and I don’t think it’s possible with what we’ve learned so far in Math 10.

  • For how many rows in the dataset is the “TransDate” different from the “Prcd Date”? (Use the dt accessor attribute.) Save your answer as a variable named diff_dates. (Hint. If your answer is over 6000, something went wrong and you are probably getting that every date is different.)

  • Get a link to share this notebook with comment access like on Worksheet 0. (Click the Share & publish link at the top right; enable public sharing; change the sharing permissions from View to Comment.) Save this link as a string with the variable name link.

  • Create a tuple (names, z, june_mach, june_prop, over_prop, water, wrong_sort, diff_dates, link) and save this tuple as a pickle file (see the Worksheet 1 instructions). Submit that pickle file as your Worksheet 2 submission on Canvas.

Created in deepnote.com Created in Deepnote