Worksheet 5

Worksheet 5#

This worksheet is due Monday night of the following week. You are encouraged to work in groups of up to 3 total students, but each student should submit their own file. (It’s fine for everyone in the group to upload the same file.)

These questions refer to the attached vending machines csv file, data_groups.csv. This is a very nice (artificial) dataset for demonstrating the usefulness of data visualization.

Put the full names of everyone in your group (even if you’re working alone) here. (This makes grading easier.)

Names:

Part 1 - Without data visualization#

Load the data using pandas as a DataFrame stored as df and look at the first few rows.

Look over the following code (and its output). (You don’t need to write anything here, but for a quiz or midterm, you should be able to understand this kind of code.)

groups = sorted(df["grouping"].unique())

for gp in groups:
    df_sub = df[df["grouping"] == gp]
    print("The group is", gp) # if you know f-strings, those could be used instead
    print("The mean of x is", df_sub["x"].mean(), 
            "The standard deviation of x is", df_sub["x"].std())
    print("The mean of y is", df_sub["y"].mean(), 
            "The standard deviation of y is", df_sub["y"].std())
    print()

Here is another way to get the same mean information. Look over this code and its output and the output above, and see how they’re related. (Again, you don’t need to write anything for this part.)

df.groupby("grouping").mean()

How do you think you get the same standard deviation information using groupby? Try it, and check that the information does match what we got using the for loop.

In a markdown cell, describe, what is the main takeaway from the above outputs, in terms of what they tell you about the dataset? Feel free to make more computations if they are helpful.

Part 2: Visualizing the data all at once#

Using Altair, plot the data in df using a scatter plot (mark_circle). Encode the “x” column in the x-axis, the “y” column in the y-axis, and the “grouping” in the color. Use :N after specifying “grouping” to tell Altair that these groupings are categories, not quantitative values.

(The resulting plot should look like a mess.)

Part 3: Visualizing the 13 groupings individually#

We will see that these 13 groups of data are fundamentally different by plotting them separately. We don’t have to do this by hand. Altair, Seaborn, and Plotly Express all have their own way to do this separate plotting automatically.

Using Altair#

Make a facet chart in Altair by taking the same Chart definition as above, and including row="grouping:N" within the encoding. This tells Altair to put each grouping in its own row. Reference 1 (but don’t look at the first for loop example, scroll below that). Reference 2

Using Seaborn#

Make the same facet chart, but using Seaborn. Reference 1 (Scroll down to the example… tips is the name of their DataFrame, so you should use df instead of tips. It seems like the hue keyword argument should be used with sns.FacetGrid, while the x and y arguments should be used with map). Reference 2

Using Plotly Express#

Make the same facet chart using Plotly Express. Reference (To get it to look similar to the Altair and Seaborn versions, I added the keyword argument height=4000 to the px.scatter function.)

Submission#

Using the Share button at the top right, enable public sharing, and enable Comment privileges. Then submit the created link on Canvas.