Week 3 Wednesday#

Announcements#

  • I’m trying to use a “paid” machine. I’m not sure what happens when you duplicate this project. You might have to go to Environment on the right-hand side, click the Settings wheel, and switch to the Basic machine.

  • Deepnote support thinks my “hardware resetting” issue should be fixed. Let’s try and see.

  • Reminder: I have office hours in here Friday before class, 12-1pm.

  • Worksheet 6 posted today.

  • Maya is here to help.

Plan#

  • Encoding data types

  • Other types of charts in Altair

  • Multi-view plots in altair

  • Install the newest version of Altair by executing the following command. (Even though we are executing it in a code cell, this isn’t a Python command, it is a command for the underlying Linux operating system.)

Let me know if this process significantly slows down your workspace!

!pip install altair==5.0.0rc1
!pip install altair==5.0.0rc1
Collecting altair==5.0.0rc1
  Downloading altair-5.0.0rc1-py3-none-any.whl (709 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 709.5/709.5 KB 64.4 MB/s eta 0:00:00
?25hRequirement already satisfied: numpy in /shared-libs/python3.9/py/lib/python3.9/site-packages (from altair==5.0.0rc1) (1.23.4)
Requirement already satisfied: jinja2 in /shared-libs/python3.9/py-core/lib/python3.9/site-packages (from altair==5.0.0rc1) (2.11.3)
Requirement already satisfied: jsonschema>=3.0 in /shared-libs/python3.9/py-core/lib/python3.9/site-packages (from altair==5.0.0rc1) (3.2.0)
Requirement already satisfied: typing-extensions>=4.0.1 in /shared-libs/python3.9/py/lib/python3.9/site-packages (from altair==5.0.0rc1) (4.4.0)
Requirement already satisfied: pandas>=0.18 in /shared-libs/python3.9/py/lib/python3.9/site-packages (from altair==5.0.0rc1) (1.2.5)
Requirement already satisfied: toolz in /shared-libs/python3.9/py/lib/python3.9/site-packages (from altair==5.0.0rc1) (0.12.0)
Requirement already satisfied: pyrsistent>=0.14.0 in /shared-libs/python3.9/py-core/lib/python3.9/site-packages (from jsonschema>=3.0->altair==5.0.0rc1) (0.18.1)
Requirement already satisfied: attrs>=17.4.0 in /shared-libs/python3.9/py-core/lib/python3.9/site-packages (from jsonschema>=3.0->altair==5.0.0rc1) (22.1.0)
Requirement already satisfied: six>=1.11.0 in /shared-libs/python3.9/py-core/lib/python3.9/site-packages (from jsonschema>=3.0->altair==5.0.0rc1) (1.16.0)
Requirement already satisfied: setuptools in /root/venv/lib/python3.9/site-packages (from jsonschema>=3.0->altair==5.0.0rc1) (58.1.0)
Requirement already satisfied: pytz>=2017.3 in /shared-libs/python3.9/py/lib/python3.9/site-packages (from pandas>=0.18->altair==5.0.0rc1) (2022.5)
Requirement already satisfied: python-dateutil>=2.7.3 in /shared-libs/python3.9/py-core/lib/python3.9/site-packages (from pandas>=0.18->altair==5.0.0rc1) (2.8.2)
Requirement already satisfied: MarkupSafe>=0.23 in /shared-libs/python3.9/py-core/lib/python3.9/site-packages (from jinja2->altair==5.0.0rc1) (2.0.0)
Installing collected packages: altair
  Attempting uninstall: altair
    Found existing installation: altair 4.2.2
    Not uninstalling altair at /shared-libs/python3.9/py/lib/python3.9/site-packages, outside environment /root/venv
    Can't uninstall 'altair'. No files were found to uninstall.
Successfully installed altair-5.0.0rc1
WARNING: You are using pip version 22.0.4; however, version 23.1 is available.
You should consider upgrading via the '/root/venv/bin/python -m pip install --upgrade pip' command.

  • Import Altair (how is that different from what we just did?) and check that the version is indeed 5.0.0.

import altair as alt

Encoding data types#

(This notion of quantitative data vs categorical data will also be very important when we get to the Machine Learning portion of Math 10.) Altair chooses different default values depending on the type of the data being encoded. These are the 5 types of data distinguished by Altair (Reference):

Data Type

Shorthand Code

Description

quantitative

Q

a continuous real-valued quantity

ordinal

O

a discrete ordered quantity

nominal

N

a discrete unordered category

temporal

T

a time or date value

geojson

G

a geographic shape

A quantitative data type is just an ordinary numeric data type, like floats. Ordinal and Nominal data types are categorical data types, where the values represent discrete categories or classes. We use the Ordinal designation if the categories have a natural ordering and we use Nominal if the categories do not have a natural ordering. A Temporal data type is used for data representing datetime-like values. The last encoding data type (which I haven’t covered in Math 10 before) is for geographic values (like for maps).

  • Load the “mpg” dataset (sns.load_dataset) from Seaborn and name the DataFrame df.

import seaborn as sns
df = sns.load_dataset("mpg")
  • Find the sub-DataFrame for which the name of the car contains the substring “skylark”. Name the sub-DataFrame df_sub. (Reminder. Use str and contains.)

Notice how the row at index 1 has the substring "skylark" in the “name” column.

df.head(4)
mpg cylinders displacement horsepower weight acceleration model_year origin name
0 18.0 8 307.0 130.0 3504 12.0 70 usa chevrolet chevelle malibu
1 15.0 8 350.0 165.0 3693 11.5 70 usa buick skylark 320
2 18.0 8 318.0 150.0 3436 11.0 70 usa plymouth satellite
3 16.0 8 304.0 150.0 3433 12.0 70 usa amc rebel sst

We need to use the accessor attribute str, because that will give us access to methods like contains. If we try to use contains directly on this pandas Series, then we get an error, because pandas Series do not have such a method.

df["name"].contains("skylark")
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In [8], line 1
----> 1 df["name"].contains("skylark")

File /shared-libs/python3.9/py/lib/python3.9/site-packages/pandas/core/generic.py:5465, in NDFrame.__getattr__(self, name)
   5463 if self._info_axis._can_hold_identifiers_and_holds_name(name):
   5464     return self[name]
-> 5465 return object.__getattribute__(self, name)

AttributeError: 'Series' object has no attribute 'contains'

On the other hand, if we use str, then we can use the contains method. Notice how we get a True in the position at index 1, corresponding to the substring "skylark" we saw above.

df["name"].str.contains("skylark")
0      False
1       True
2      False
3      False
4      False
       ...  
393    False
394    False
395    False
396    False
397    False
Name: name, Length: 398, dtype: bool

Here we use Boolean indexing to get the appropriate sub-DataFrame. Notice how there are only 4 rows, and each of the names contains the substring “skylark”.

df_sub = df[df["name"].str.contains("skylark")]
df_sub
mpg cylinders displacement horsepower weight acceleration model_year origin name
1 15.0 8 350.0 165.0 3693 11.5 70 usa buick skylark 320
226 20.5 6 231.0 105.0 3425 16.9 77 usa buick skylark
305 28.4 4 151.0 90.0 2670 16.0 79 usa buick skylark limited
339 26.6 4 151.0 84.0 2635 16.4 81 usa buick skylark
  • Make a scatter plot in Altair from this sub-DataFrame using the “model_year” for both the x-coordinate and the color, and using “mpg” for the y-coordinate. (We can increase the size of the points, and remove zero from the x-axis, to make it easier to see.)

Here is the basic chart that shows up. We will make it look better by making some changes.

alt.Chart(df_sub).mark_circle().encode(
    x="model_year",
    color="model_year",
    y="mpg"
)

First we increase the point size.

alt.Chart(df_sub).mark_circle(size=150).encode(
    x="model_year",
    color="model_year",
    y="mpg"
)

Next we specify that the x-axis is not required to include zero. This syntax has recently become shorter. If you skipped the !pip install ... step above, you will need to use the old syntax:

x=alt.X("model_year", scale=alt.Scale(zero=False))

instead of the new syntax:

x=alt.X("model_year").scale(zero=False)
# without pip install: x=alt.X("model_year", scale=alt.Scale(zero=False))
alt.Chart(df_sub).mark_circle(size=150).encode(
    x=alt.X("model_year").scale(zero=False),
    color="model_year",
    y="mpg"
)

I would say the above chart still does not look very good. Let’s see what effect changing the encoding types will have.

  • What changes if you specify different encoding types for “model_year”? (The difference in color between quantitative and ordinal will be more clear if you use a different color scheme: options.)

Here we switch the x-axis to the “Ordinal” encoding data type, using :O. Notice how now the values 70, 77, 79, 81 are now treated like discrete categories, and the spacing between them is ignored.

alt.Chart(df_sub).mark_circle(size=150).encode(
    x=alt.X("model_year:O").scale(zero=False),
    color="model_year",
    y="mpg"
)

Here we change color scheme (see the above link for options). We are specifying that the color channel should use a “Quantitative” encoding, but that is the default, so you will see the same thing if you do not use that. (This color scheme syntax will not work if you are on the previous version of Altair. See above or the version 4 documentation for how to make the change.)

alt.Chart(df_sub).mark_circle(size=150).encode(
    x=alt.X("model_year:O").scale(zero=False),
    color=alt.Color("model_year:Q").scale(scheme="lightgreyred"),
    y="mpg"
)

Here is the exact same chart, but where we switch to the “Ordinal” encoding data type. Do you see the differences? One different is that the “Quantiative” legend shows a continuous progression of numbers. A more subtle difference is that the colors for 77, 79, 81 are grouped closer to each other in the “Quantitative” version, whereas everything, including the 70, is equally spaced colorwise in the “Ordinal” version.

alt.Chart(df_sub).mark_circle(size=150).encode(
    x=alt.X("model_year:O").scale(zero=False),
    color=alt.Color("model_year:O").scale(scheme="lightgreyred"),
    y="mpg"
)

If you switch the color data type to “Nominal” (which means unordered) and use the default color scheme, you can see that there is no ordering or progression to the colors used. The colors in this case are chosen to make the values as distinct as possible.

alt.Chart(df_sub).mark_circle(size=150).encode(
    x=alt.X("model_year:O").scale(zero=False),
    color="model_year:N",
    y="mpg"
)

Other types of charts in Altair#

Here we switch back to the full DataFrame, df. There are many types of charts in Altair (browse the example gallery to see some of the possibilities).

  • Make a bar chart using “cylinders” for the x-coordinate, using the median of the mpg values for the y-coordinate.

We haven’t gotten to the “median” part yet.

alt.Chart(df).mark_bar().encode(
    x="cylinders",
    y="mpg"
)

We specify ordinal encoding using :O. This looks better, but it still looks strange. Do you see all the little white horizontal lines? That is because the “mpg” bars are getting stacked on top of each other. Another thing to notice: the bars go up to about 6000… that is more evidence that these are being stacked on top of each other.

This kind of chart makes most sense if we perform some kind of aggregation, which we will do below using median.

alt.Chart(df).mark_bar().encode(
    x="cylinders:O",
    y="mpg"
)

There is no column called "median(mpg)" in our DataFrame. Instead this syntax is telling Altair to compute the median and plot the bar heights based on the result.

alt.Chart(df).mark_bar().encode(
    x="cylinders:O",
    y="median(mpg)"
)
  • Add a tooltip so we can find the precise median values.

For example, if you put your mouse over the 4-cylinders bar, it will report a median value of 28.25. That is telling us that the median miles-per-gallon across 4-cylinder cars in the dataset is 28.25.

alt.Chart(df).mark_bar().encode(
    x="cylinders:O",
    y="median(mpg)",
    tooltip=["median(mpg)", "cylinders"]
)
  • Can you find these same median values using df.groupby? Deepnote hides the warning, but use the keyword argument numeric_only when computing the median to avoid a Python warning.

df.groupby("cylinders").median(numeric_only=True)
mpg displacement horsepower weight acceleration model_year
cylinders
3 20.25 70.0 98.5 2375.0 13.5 75.0
4 28.25 105.0 78.0 2232.0 16.2 78.0
5 25.40 131.0 77.0 2950.0 19.9 79.0
6 19.00 228.0 100.0 3201.5 16.1 76.0
8 14.00 350.0 150.0 4140.0 13.0 73.0

The above result is a pandas DataFrame, so we can get a column out of that DataFrame just like always, by using square brackets and the name of the column.

Notice how that same 28.25 number is visible here.

df.groupby("cylinders").median(numeric_only=True)["mpg"]
cylinders
3    20.25
4    28.25
5    25.40
6    19.00
8    14.00
Name: mpg, dtype: float64
  • Make a “rectangle chart” using mark_rect with “model_year” along the x-axis, with “cylinders” along the y-axis, and with the rectangles colored by "count()".

Note that here "count()" is something defined by Altair, not one of the columns in df.

Reading based on the colors, it appears that there are the most cars from the year 82 and with 4 cylinders in this dataset. The tooltip lets us check that there are 28 such values in the dataset.

Here we are storing the chart in the variable c1 so we can refer to it below.

c1 = alt.Chart(df).mark_rect().encode(
    x="model_year:O",
    y="cylinders:O",
    color="count()",
    tooltip=["count()"]
)

c1
  • Make a “text chart” using mark_text with the same parameters as above, but remove the color encoding, and add a text encoding based on "count()".

By itself, this looks a little strange. (We are using text for the marks in this case, that is why it is called mark_text.)

c2 = alt.Chart(df).mark_text().encode(
    x="model_year:O",
    y="cylinders:O",
    text="count()"
)

c2
  • Layer these last two charts together, either using + or using alt.layer.

c1+c2

The + notation in Altair is just shorthand for layering using alt.layer. Here we are getting both charts displayed at once.

alt.layer(c1, c2)

We didn’t get further than this.

Multi-view plots in altair#

A facet chart breaks the dataset up into sub-datasets and makes a different chart for each sub-dataset.

  • Make a facet chart using “horsepower” for the x-coordinate, “mpg” for the y-coordinate, “cylinders” for the color with the Nominal data encoding type, and dividing the data according to the number of cylinders. Put each chart in its own row.

Time to work on Worksheets 5-6#

  • Maya and I are here to help.